Apologies as I am new to using Keras and working with LSTM predictions in general. I am writing code that takes in a CSV file whose columns are float or int values which are related in some way, uses a Keras LSTM model to train against these columns, and attempts to predict one of the columns as an output. I am following this guide:
https://machinelearningmastery.com/multivariate-time-series-forecasting-lstms-keras/
In the example, the relevant predicted column is the amount of air pollution, and all other values are used to predict this value. Adapting the example code for my CSV seems straightforward-- I change the size of the training data and the number of columns appropriately.
My issue is that I don't understand why the example code is outputting predicted values for the "pollution" column and not some other column. I could just make the column I want to predict the second column in my formatted input CSV, but I would like to understand what is actually happening in the example as much as possible. Looking at the documentation for Model.predict(), it says the input value can be a list of arrays if the model has multiple inputs, but the return value is only described as "numpy array(s) of predictions", and it doesn't specify how I can make it return "arrays" versus an array. When I print out the result of this function, I only get an array of predictions for the pollution variable, so it seems like that column is selected somewhere before this point.
How can I change which column is returned by predict()?
Changing which column is returned by predict() depends on what you select your output data (y) to be. When the author preprocessed their data, they made the current pollution the last column of their dataset. Then, when selecting an output, y, to evaluate the model, they ran these lines of code:
# split into input and outputs
train_X, train_y = train[:, :-1], train[:, -1]
test_X, test_y = test[:, :-1], test[:, -1]
the input arrays (X) array include all rows and every column except the last, as denoted by their index, whereas the output (y) array includes all rows but only the last column, which is the pollution variable.
When the model is training, it is trying to use the inputs, in this case the previous timestep inputs, to accurately predict the output, which in this case is the pollution at the current time. Therefore, when the model makes predictions, it will use this function that it learned to relate the two datasets to predict pollution.
So, in summation, select the column that you want your model to predict as the train_y and test_y datasets! Hope this helps!
Related
I am using the xgboost XGBRegressor to train on a data of 20 input dimensions:
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=20)
model.fit(trainX, trainy, verbose=False)
trainX is 2000 x 19, and trainy is 2000 x 1.
In another word, I am using the 19 dimensions of trainX to predict the 20th dimension (the one dimension of trainy) as the training.
When I am making a prediction:
yhat = model.predict(x_input)
x_input has to be 19 dimensions.
I am wondering if there is a way to keep using the 19 dimensions to train prediction the 20th dimension. But during the prediction, x_input has only 4 dimensions to predict the 20th dimension. It is kinda of a transfer learning to different input dimension.
Does xgboost supports such a feature? I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.
Fundamentally, you're training your model with a dense dataset (19/19 feature values), and are now wondering if you're allowed to make predictions with a sparse dataset (4/19 feature values).
Does xgboost supports such a feature?
Yes, it is technically possible with XGBoost, because XGBoost will treat the absent 15/19 feature values as missing. It will not be possible with some other ML framework (such as Scikit-Learn) that do not work with sparse input by default.
Alternatively, you can make your XGBoost model explicitly "missing-value-proof" by assembling a pipeline which contains feature imputation step(s).
I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.
You should represent missing values as float("NaN") (not as None).
If I understand your question correctly, you are trying to train a model with 19 features, but then feed it only 1 feature to make a prediction.
That's not going to be possible. When you train a model, you are assuming that your data points are drawn from a probability distribution P(X,Y), where Y is your label and X is your features. If you try to change the dimensionality of X, it'll no longer belong to that distribution (at least intuitively, I am not a mathematician so, I cannot come up with a proof for this).
For instance, let's assume your data lies on a 3D cube. That means that you need three coordinate axes to represent a point on it. You cannot place a point using 2 dimensions without assuming the value of the remaining dimension.
You can assume the values of the features you try to drop, but they may not represent the data you originally trained on.
I've set up a neural network regression model using Keras with one target. This works fine,
now I'd like to include multiple targets. The dataset includes a total of 30 targets, and I'd rather train one neural network instead of 30 different ones.
My problem is that in the preprocessing of the data I have to remove some target values, for a given example, as they represent unphysical values that are not to be predicted.
This creates the issues that I have a varying number of targets/output.
For example:
Targets =
None, 0.007798, 0.012522
0.261140, 2110.000000, 2440.000000
0.048799, None, None
How would I go about creating a keras.Sequential model(or functional) with a varying number of outputs for a given input?
edit: Could I perhaps first train a classification model that predicts the number of outputs given some test inputs, and then vary the number of outputs in the output layer according to this prediction? I guess I would have to use the functional API for something like that.
The "classification" edit here is unnecessary, i.e. ignore it. The number of outputs of the test targets is a known quantity.
(Sorry, I don't have enough reputation to comment)
First, do you know up front whether some of the output values will be invalid or is part of the problem predicting which outputs will actually be valid?
If you don't know up front which outputs to disregard, you could go with something like the 2-step approach you described in your comment.
If it is deterministic (and you know how so) which outputs will be valid for any given input and your problem is just how to set up a proper model, here's how I would do that in keras:
Use the functional API
Create 30 named output layers (e.g. out_0, out_1, ... out_29)
When creating the model, just use the outputs argument to list all 30 outputs
When compiling the model, specify a loss for each separate output, you can do this by passing a dictionary to the loss argument where the keys are the names of your output layers and the values are the respective losses
Assuming you'll use mean-squared error for all outputs, the dictionary will look something like {'out_0': 'mse', 'out_1': 'mse', ..., 'out_29': 'mse'}
When passing inputs to the models, pass three things per input: x, y, loss-weights
y has to be a dictionary where the key is the output layer name and the value is the target output value
The loss-weights are also a dictionary in the same format as y. The weights in your case can just be binary, 1 for each output that corresponds to a real value, 0 for each output that corresponds to unphysical values (so they are disregarded during training) for any given sample
Don't pass None's for the unphysical value targets, use some kind of numeric filler, otherwise you'll get issues. It is completely irrelevant what you use for your filler as it will not affect gradients during training
This will give you a trainable model. BUT once you move on from training and try to predict on new data, YOU will have to decide which outputs to disregard for each sample, the network will likely still give you "valid"-looking outputs for those inputs.
One possible solution would be to have a separate output of "validity flags" which takes values in range from zero to one. For example, your first target will be
y=[0.0, 0.007798, 0.012522]
yf=[0.0, 1.0, 1.0]
where zeros indicate invalid values.
Use sigmoid activation function for yf.
Loss function can be the sum of losses for y and yf.
During inference, analyze the network output for yf and only consider y value valid if corresponding yf exceeds 0.5 threshold
I am wondering how to predict and get future time series data after model training. I would like to get the values after N steps. I wonder if the time series data has been properly learned and predicted. How do I do this right to get the following(next) value? I want to get the next value using model.predict or similar.
I have x_test and x_test[-1] == t So, the meaning of the next value is t+1, t+2, .... t+n. In this example I want to get t+1, t+2 ... t+n
First
I tried using stock index data
inputs = total_data[len(total_data) - forecast - look_back:]
inputs = scaler.transform(inputs)
X_test = []
for i in range(look_back, inputs.shape[0]):
X_test.append(inputs[i - look_back:i])
X_test = np.array(X_test)
predicted = model.predict(X_test)
but the result is like below
The results from X_test[-20:] and the following 20 predictions looks like same. I'm wondering if it's the correct method to train and predicted value and also if the result was correct.
full source
The method I tried first did not work correctly.
Second
I realized something is wrong, I tried using another official data so I used the time series in the Tensorflow tutorial to practice training the model.
a = y_val[-look_back:]
for i in range(N-step prediction): #predict a new value n times.
tmp = model.predict(a.reshape(-1, look_back, num_feature)) #predicted value
a = a[1:] #remove first
a = np.append(a, tmp) #insert predicted value
The results were predicted in a linear regression shape very differently from the real data.
Output a linear regression abnormal that is independent of the real data:
full source (After the 25th line is my code.)
I'm really very curious that How can I predict the following value of time series using Tensorflow predict method
I'm not wondering if this works or not theoretically. I'm just wondering how to get the following n steps using the predict method.
Thank you for reading the long question. I seek advice about your priceless opinion.
In the Second approach, Output is not expected, as per my understanding, because of a small mistake in the code.
The line of code,
a = y_val[-look_back:]
should be replaced by
look_back = 20
x = x_val_uni
a = x[-look_back:]
a.shape
In other words, we should send X Values as Inputs to the Model for the Prediction, not the Y Values.
However, we can compare it's predictions with Y Values, with the code,
y = y_val_uni[-20:]
plt.plot(y)
plt.plot(tmp)
plt.show()
Which would result in the plot shown below:
Please find the Complete Working Code in this Google Colab Gist.
The data provided contains a dataset of written numbers (45000 rows, 784 columns), a dataset of spoken numbers (45000 rows, 1 column) and a dataset containing labels with values TRUE or FALSE (45000, boolean). The label-value is TRUE when the spoken and written value are corresponding to the same number. Therefore, I need to create a neural network to predict if the outcome would be TRUE or FALSE.
Firstly, I summarized the data (min, max, mean, std dev) and applied a Perceptron but the error rate was too high. So now I'm trying to create a neural network to apply to this problem. But I'm getting stuck.
First I reshaped the data to get it to work:
written_train = written_train.reshape(45000,28,28)
spoken_train = spoken_train.reshape(45000,1,1)
After that I wanted to apply this model to the data:
def model(i,p,data_a,data_b,labels):
x=Input(shape=(28,28))
y=Input(shape=(1,1))
admi=LSTM(40,return_sequences=False)(x)
pla=LSTM(40,return_sequences=False)(y)
out=concatenate([admi,pla],axis=-1)
print(pla)
print(out)
output=Dense(1, activation='sigmoid')(out)
model = Model(inputs=[x, y], outputs=output)
model.compile(optimizer='rmsprop',
loss='binary_crossentropy',
metrics=['accuracy'])
n=model.fit([data_a, data_b],labels,batch_size=1, epochs=10)
return n
re=model(1,4,written_train,spoken_train,match_train)
print(re)
This results in the following error message:
ValueError: setting an array element with a sequence.
So I do think there is some pre-processing needed but I don't know how and what to do.
Also, this is my first time on Stackoverflow. So if there's anything I need to improve, please let me know :)
I have a dataset with 100k rows, which are the pairs of store-item numbers (eg. (store 1, item 190)), 300 columns, which are a series of dates (eg. 2017-01-01, 2017-01-02, 2017-01-03 ...). Values are the sales.
I tried to use LSTM keras to predict future sales, how can I construct my train and validation dataset?
I am thinking to split train and validation like data[:, :n_days] and data[:, n_days:]. So I will have same number of samples (100k) in both my train and validation dataset. Do I think it wrong?
If this is the way, how should I define n_days, should the train and validation dataset be exactly the same dimensions? (something like, n_days = 150, 149 days used to predict 1 day).
how can I construct my train and validation dataset?
Not sure if a rule of thumb, but a common approach is to split your dataset into a ~80% training set and ~20% validation set; in your case this would be approximately 80k and 20k. The actual percentages may vary, but that ratio is the one most sources recommend. Ideally you would also want to have a test dataset, one that you never used during training or validation, to evaluate the final performance of your models.
Now, regarding the shape of your data it is important to recall what the keras docs on Recurrent Layers says:
Input shape
3D tensor with shape (batch_size, timesteps, input_dim).
Defining this shape would depend on the nature of your problem. You mention you want to predict sales, so this can be phrased as a Regression Problem. You also mention your data consists of 300 columns that make up your time series, and naturally you have the real sales value for each of those rows.
In this case, using a batch size of 1, your shape seems will be (1, 300, 1). Which means you are training on batches of 1 element (the most thorough Gradient update), where each has 300 time steps and 1 feature or dimension on each time step.
For splitting your data one option you can use that has helped me before is the train_test_split method from Sklearn, where you simply pass your data and labels and indicate the ratio you want:
from sklearn.cross_validation import train_test_split
#Split your data to have 15% validation split
X, X_val, Y, Y_val = train_test_split(data, labels, test_size=0.15)