SKlearn prediction on test dataset with different shape from training dataset shape

SKlearn prediction on test dataset with different shape from training dataset shape - python

I'm new to ML and would be grateful for any assistance provided. I've run a linear regression prediction using test set A and training set A. I saved the linear regression model and would now like to use the same model to predict a test set A target using features from test set B. Each time I run the model it throws up the error below
How can I successfully predict a test data set from features and a target with different shapes?
Input
print(testB.shape)
print(testA.shape)
Output
(2480, 5)
(1315, 6)
Input
saved_model = joblib.load(filename)
testB_result = saved_model.score(testB_features, testA_target)
print(testB_result)
Output
ValueError: Found input variables with inconsistent numbers of samples: [1315, 2480]
Thanks again

They are inconsistent shapes which is why the error is being thrown. Have you tried to reshape the data so one of them are same shape? From a quick look, it seems that you have more samples and one less feature in testA.
Think about it, if you have trained your model with 5 features you cannot then ask the same model to make a prediction given 6 features. You speak of using a Linear Regressor, the equation is roughly:
y = b + w0*x0 + w1*x1 + w2*x2 + .. + wN-1*xN-1
Where {
y is your output/label
N is the number of features
b is the bias term
w(i) is the ith weight
x(i) is the ith feature value
}
You have trained a linear regressor with 5 features, effectively producing the following
y (your output/label) = b + w0*x0 + w1*x1 + w2*x2 + w3*x3 + w4*x4
You then ask it to make a prediction given 6 features but it only knows how to deal with 5.
Aside from that issue, you also have too many samples, testB has 2480 and testA has 1315. These need to match, as the model wants to make 2480 predictions, but you only give it 1315 outputs to compare it to. How can you get a score for 1165 missing samples? Do you now see why the data has to be reshaped?
EDIT
Assuming you have datasets with an equal amount of features as discussed above, you may now look at reshaping (removing data) testB like so:
testB = testB[0:1314, :]
testB.shape
(1315, 5)
Or, if you would prefer a solution using the numpy API:
testB = np.delete(testB, np.s_[0:(len(testB)-len(testA))], axis=0)
testB.shape
(1315, 5)
Keep in mind, when doing this you slice out a number of samples. If this is important to you (which it can be) then it may be better to introduce a pre-processing step to help out with the missing values, namely imputing them like this. It is worth noting that the data you are reshaping should be shuffled (unless it is already), as you may be removing parts of the data the model should be learning about. Neglecting to do this could result in a model that may not generalise as well as you hoped.

Related

XGBoost XGBRegressor predict with different dimensions than fit

I am using the xgboost XGBRegressor to train on a data of 20 input dimensions:
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=20)
model.fit(trainX, trainy, verbose=False)
trainX is 2000 x 19, and trainy is 2000 x 1.
In another word, I am using the 19 dimensions of trainX to predict the 20th dimension (the one dimension of trainy) as the training.
When I am making a prediction:
yhat = model.predict(x_input)
x_input has to be 19 dimensions.
I am wondering if there is a way to keep using the 19 dimensions to train prediction the 20th dimension. But during the prediction, x_input has only 4 dimensions to predict the 20th dimension. It is kinda of a transfer learning to different input dimension.
Does xgboost supports such a feature? I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.

Fundamentally, you're training your model with a dense dataset (19/19 feature values), and are now wondering if you're allowed to make predictions with a sparse dataset (4/19 feature values).
Does xgboost supports such a feature?
Yes, it is technically possible with XGBoost, because XGBoost will treat the absent 15/19 feature values as missing. It will not be possible with some other ML framework (such as Scikit-Learn) that do not work with sparse input by default.
Alternatively, you can make your XGBoost model explicitly "missing-value-proof" by assembling a pipeline which contains feature imputation step(s).
I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.
You should represent missing values as float("NaN") (not as None).

If I understand your question correctly, you are trying to train a model with 19 features, but then feed it only 1 feature to make a prediction.
That's not going to be possible. When you train a model, you are assuming that your data points are drawn from a probability distribution P(X,Y), where Y is your label and X is your features. If you try to change the dimensionality of X, it'll no longer belong to that distribution (at least intuitively, I am not a mathematician so, I cannot come up with a proof for this).
For instance, let's assume your data lies on a 3D cube. That means that you need three coordinate axes to represent a point on it. You cannot place a point using 2 dimensions without assuming the value of the remaining dimension.
You can assume the values of the features you try to drop, but they may not represent the data you originally trained on.

Im confused to the significance of [:,1] in the following line of code of a Logistic Regression

I am running a Logistic regression using 2 features, cylinders and years for a multiclass classification problem.
After applying the training, we are applying the model to test data:
for origin in unique_origins:
# Select testing features.
X_test = test[features]
# Compute probability of observation being in the origin.
testing_probs[origin] = models[origin].predict_proba(X_test)[:,1]
I am confused to what the [:,1] part does to the code, if anyone could kindly explain.
There is a hint which says:
testing_probs[1] should return the probability returned from model 1 for each observation.
However i do not understand what it is trying to explain.
Please help and thank you.

array[:,1] is equivalent to array[:][1] for many types of array.
Also it means that you keep all the data on the first dimension, but you pick only the 2nd data on the second dimension :
a = np.array([[0,1], [2,3]])
a[:][1]
>>> array([2, 4])
In your case, the first dimension represent the number of the prediction and the second dimension the number of the class.
Also, when calling :
models[origin].predict_proba(X_test)[i,j]
You get the prediction for the ith input of the jth class.
So, models[origin].predict_proba(X_test)[:,1] means you want the predictions for all of your inputs but only the result for the 2nd class (as the first class is indexed at 0)!

Binary classification of multivariate time series in the form of panel data using LSTM

Problem definition
Dear community, I need your help in implementing an LSTM neural network for a classification problem of panel data using Keras. The panel data I am manipulating consists of ids (let's call it id), a timestep for each id (t), n time varying covariates and a binary outcome y. Each id contains a number of timesteps and for each timestep I have my covariates and a unique outcome (0 or 1). I have reason to believe that each covariate for each id can have a certain degree of autocorrelation and henceforth can be considered a small timeseries of t steps. For simplicity, I consider that each id has a fixed number of t observations) with t not a big number (about 10 or so).
Data
Below is a toy example of what the data might look like in my case. In this example, the parameters are 2 individuals, 4 timesteps each, 4 covariates and each observation has a unique binary outcome. Covariates may be considered as (short) timeseries since they might be autocorrelated.
print(df)
[out]:
A B C D y
id t
id1 1 1.054127 0.346052 1.091299 -0.058137 0.0
2 0.621390 -0.204682 -1.056786 0.864572 0.0
3 1.275124 2.473959 0.264029 -1.047810 0.0
4 -0.328441 -0.135891 0.148498 0.470876 1.0
id2 1 0.362969 0.777082 0.197423 -2.225296 0.0
2 0.227134 0.086731 0.550267 -0.361482 0.0
3 0.223526 0.556242 -0.160042 0.675871 1.0
4 0.070125 0.156659 -2.922709 -1.143887 1.0
I have reason to assume that, for id1, the target at timestep 4 is conditional on the three previous timesteps for that same individual (id1). In addition, The target variable y may contain more than one value of 1 for each individual (as outlined in the case of id2 above). I do not have reason to believe that the data from an individual would affect the result of another (as with many behavior analysis scenarios since every individual is unique).
Prediction problem
What I would like to do is to predict a single outcome for a new individual for whom I have those 4 rows of observation. In other words, based on the historical data of an individual, I would like to know if said individual is likely to have an outcome 1 or 0. If I understand correctly, this can be achieved using an LSTM (alternatively, an RNN) with some data manipulation.
Things I have tried so far
To start simple, I have tried aggregating every set of id rows into a single row with a single outcome and applied a typical statistical learning approach such as boosted trees and got a model as good as random.
I looked into shaping it as a survival analysis problem, in vain. I would not be interested in any estimation of a survival function unlike tutorials on how to handle panel data in the medical field (nor would I have access to such data).
I have tried reshaping my data such that the input is a 3D array in the form of [observations, timesteps, features] where observations are unique ids for an LSTM like so in python :
# separate into features and target
df_feat = df.drop("y", axis = 1)
df_target = df[["y"]]
# get reshaped values for 3D tensor
n_samples = len(df_feat.index.get_level_values('id').unique().tolist())
n_timesteps = 4
n_features = df_feat.shape[1]
# reshape input array to be 3D
X_3D = df_feat.to_numpy().reshape(n_samples, n_timesteps, n_features)
print(X_3D.shape)
[out]:
(2, 4, 4)
However, at this point I get confused as to what my learning instances for the LSTM are and what the outcome y should be shaped like. I have tried having a shape like one outcome per training instance by taking only the last observation for each id (so y=[1,1] and y.shape = (2,) in the toy example above) which technically makes an LSTM script run... but does not capture prior information. Below is the code for such LSTM:
def train_lstm(X_train, y_train, X_valid, y_valid, save_name='best_lstm.h5'):
# starts a sequential model
model = Sequential()
# add first lstm hidden layer with 64 units and default keras params
model.add(LSTM(64, input_shape = (X_train.shape[1], X_train.shape[2]), return_sequences=True))
# add a second hidden lstm layer with 128 units and default keras params
model.add(LSTM(128, return_sequences = True))
# add one last hidden layer
model.add(LSTM(64))
# add one dense layer with 2 units and a sigmoid activation function
model.add(Dense(2, activation = 'sigmoid'))
# define adam optimiser with learning rate
opt = tf.keras.optimizers.Adam(learning_rate = 0.01)
# compile model with binary cross entropy as loss function and accuracy as metrics
model.compile(optimizer = opt, loss = 'binary_crossentropy', metrics = ['accuracy'])
# define early stopping and best model checkpoint parameters
es = EarlyStopping(monitor = 'val_loss', mode = 'min', verbose = 0, patience = 20)
mc = ModelCheckpoint(save_name, monitor = 'val_accuracy', mode = 'max', verbose = 0, save_best_only = True)
# train the model using fit method (target vector is one-hot encoded as required by keras)
history = model.fit(X_train, tf.one_hot(y_train, depth = 2),
validation_data = (X_valid, tf.one_hot(y_valid, depth = 2)),
epochs = 100, callbacks = [es, mc])
return history
It runs and it makes predictions the way I want them to (for one id of previous history, we can predict one outcome) but results in poor performance since it fails to capture outcomes prior to the last.
I have carefully read and followed this nicely written medium article by Alexander Laskorunsky which remotely resembles what I am trying to do, and slides the window of K-length frames to capture the prior outcomes (and not just the last as I have done which makes more sense). However, in Alexander's case, he does not consider panel data but rather a multivariate timeseries classification that uses n_timesteps to predict the target using all predictors and all rows even if it overlaps (so not using panel data).
Questions
Am I right to believe that I need a many to one LSTM architecture?
How may I divide and reshape training and testing samples such that a new, previously unseen individual which would not be related in any way to other ids can be classified?
Should each id be considered as one sample / training instance? Should each id be split into training and testing sets and concatenate all training and testing sets to feed to an LSTM architecture?
Would you be so kind as to provide code snippets on how to correctly split and reshape my data as well as a simple LSTM architecture using keras (or maybe modify my own function above in case I coded it wrong)? No need for basic preprocessing and encoding variables.
Any help or advice / tutorials / articles regarding what architecture is most suitable for that kind of problem is greatly appreciated and thank you in advance for your help!

How to interpret output of .predict() from fitted scikit-survival model in python?

I'm confused how to interpret the output of .predict from a fitted CoxnetSurvivalAnalysis model in scikit-survival. I've read through the notebook Intro to Survival Analysis in scikit-survival and the API reference, but can't find an explanation. Below is a minimal example of what leads to my confusion:
import pandas as pd
from sksurv.datasets import load_veterans_lung_cancer
from sksurv.linear_model import CoxnetSurvivalAnalysis
# load data
data_X, data_y = load_veterans_lung_cancer()
# one-hot-encode categorical columns in X
categorical_cols = ['Celltype', 'Prior_therapy', 'Treatment']
X = data_X.copy()
for c in categorical_cols:
dummy_matrix = pd.get_dummies(X[c], prefix=c, drop_first=False)
X = pd.concat([X, dummy_matrix], axis=1).drop(c, axis=1)
# display final X to fit Cox Elastic Net model on
del data_X
print(X.head(3))
so here's the X going into the model:
Age_in_years Celltype Karnofsky_score Months_from_Diagnosis \
0 69.0 squamous 60.0 7.0
1 64.0 squamous 70.0 5.0
2 38.0 squamous 60.0 3.0
Prior_therapy Treatment
0 no standard
1 yes standard
2 no standard
...moving on to fitting model and generating predictions:
# Fit Model
coxnet = CoxnetSurvivalAnalysis()
coxnet.fit(X, data_y)
# What are these predictions?
preds = coxnet.predict(X)
preds has same number of records as X, but their values are wayyy different than the values in data_y, even when predicted on the same data they were fit on.
print(preds.mean())
print(data_y['Survival_in_days'].mean())
output:
-0.044114643249153422
121.62773722627738
So what exactly are preds? Clearly .predict means something pretty different here than in scikit-learn, but I can't figure out what. The API Reference says it returns "The predicted decision function," but what does that mean? And how do I get to the predicted estimate in months yhat for a given X? I'm new to survival analysis so I'm obviously missing something.

I posted this question on github, though the author renamed the issue question.
I got some helpful explanation of what the predict output is, but still am not sure how to get to a set of predicted survival times, which is what I really want. Here's a couple helpful explanations from that github thread:
predictions are risk scores on an arbitrary scale, which means you can
usually only determine the sequence of events, but not their exact time.
-sebp (library author)
It [predict] returns a type of risk score. Higher value means higher
risk of your event (class value = True)...You were probably looking
for a predicted time. You can get the predicted survival function with
estimator.predict_survival_function as in the example 00
notebook...EDIT: Actually, I’m trying to extract this but it’s been a
bit of a pain to munge
-pavopax.
There's more explanation at the github thread, though I wasn't really able to follow all of it. I need to play around with predict_survival_function and predict_cumulative_hazard_function and see if I can get to a set of predictions for most likely survival time by row in X, which is what I really want.
I'm not going to accept this answer here, in case anyone else has a better one.

With the X input, you get an evaluation of the input array:
def predict(self, X, alpha=None):
"""The linear predictor of the model.
Parameters
----------
X : array-like, shape = (n_samples, n_features)
Test data of which to calculate log-likelihood from
alpha : float, optional
Constant that multiplies the penalty terms. If the same alpha was used during training, exact
coefficients are used, otherwise coefficients are interpolated from the closest alpha values that
were used during training. If set to ``None``, the last alpha in the solution path is used.
Returns
-------
T : array, shape = (n_samples,)
The predicted decision function
"""
X = check_array(X)
coef = self._get_coef(alpha)
return numpy.dot(X, coef)
The definition check_array comes from another library.
You can review the code of coxnet.

3darray training/testing TensorFlow RNN LSTM

(I am testing my abilities to write short but effective questions so let me know how I do here)
I am trying to train/test a TensorFlow recurrent neural network, specifically an LSTM, with some trials of time-series data in the following ndarray format:
[[[time_step_trial_0, feature, feature, ...]
[time_step_trial_0, feature, feature, ...]]
[[time_step_trial_1, feature, feature, ...]
[time_step_trial_1, feature, feature, ...]]
[[time_step_trial_2, feature, feature, ...]
[time_step_trial_2, feature, feature, ...]]]
The the 1d portion of this 3darray holds the a time step and all feature values that were observed at that time step. The 2d block contains all 1d arrays (time steps) that were observed in one trial. The 3d block contains all 2d blocks (trials) recorded for the time-series dataset. For each trial, the time step frequency is constant and the window interval is the same across all trials (0 to 50 seconds, 0 to 50 seconds, etc.).
For example, I am given data for Formula 1 race cars such as torque, speed, acceleration, rotational velocity, etc. Over a certain time interval recording time steps every 0.5 seconds, I form 1d arrays with each time step versus the recorded features recorded at that time step. Then I form a 2D array around all time steps corresponding to one Formula 1 race car's run on the track. I create a final 3D array holding all F1 cars and their time-series data. I want to train and test a model to detect anomalies in the F1 common trajectories on the course for new cars.
I am currently aware that the TensorFlow models support 2d arrays for training and testing. I was wondering what procedures I would have to go through in order the be able to train and test the model on all of the independent trials (2d) contained in this 3darray. In addition, I will be adding more trials in the future. So what are the proper procedures to go through in order to constantly be updating my model with the new data/trials to strengthen my LSTM.
Here is the model I was trying to initially replicate for a different purpose other than human activity: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition. Another more feasible model would be this which I would much rather look at for anomaly detection in the time-series data: https://arxiv.org/abs/1607.00148. I want to build a anomaly detection model that given the set of non-anomalous time-series training data, we can detect anomalies in the test data where parts of the data over time is defined as "out of family."

I think for most LSTM's you're going to want to think of your data in this way (as it will be easy to use as input for the networks).
You'll have 3 dimension measurements:
feature_size = the number of different features (torque, velocity, etc.)
number_of_time_steps = the number of time steps collected for a single car
number_of_cars = the number of cars
It will most likely be easiest to read your data in as a set of matrices, where each matrix corresponds to one full sample (all the time steps for a single car).
You can arrange these matrices so that each row is an observation and each column is a different parameter (or the opposite, you may have to transpose the matrices, look at how your network input is formatted).
So each matrix is of size:
number_of_time_steps x feature_size (#rows x #columns). You will have number_of_cars different matrices. Each matrix is a sample.
To convert your array to this format, you can use this block of code (note, you can already access a single sample in your array with A[n], but this makes it so the shape of the accessed elements are what you expect):
import numpy as np
A = [[['car1', 'timefeatures1'],['car1', 'timefeatures2']],
[['car2', 'timefeatures1'],['car2', 'timefeatures2']],
[['car3', 'timefeatures1'],['car3', 'timefeatures2']]
]
easy_format = np.array(A)
Now you can get an individual sample with easy_format[n], where n is the sample you want.
easy_format[1] prints
array([['car2', 'timefeatures1'],
['car2', 'timefeatures2']],
dtype='|S12')
easy_format[1].shape = (2,2)
Now that you can do that, you can format them however you need for the network you're using (transposing rows and columns if necessary, presenting a single sample at a time or all of them at once, etc.)
What you're looking to do (if I'm reading that second paper correctly) most likely requires a sequence to sequence lstm or rnn. Your original sequence is your time series for a given trial, and you're generating an intermediate set of weights (an embedding) that can recreate that original sequence with a low amount of error. You're doing this for all the trials. You will train this lstm on a series of reasonably normal trials and get it to perform well (reconstruct the sequence accurately). You can then use this same set of embeddings to try to reconstruct a new sequence, and if it has a high reconstruction error, you can assume it's anomalous.
Check this repo for a sample of what you'd want along with explanations of how to use it and what the code is doing (it only maps a sequence of integers to another sequence of integers, but can easily be extended to map a sequence of vectors to a sequence of vectors): https://github.com/ichuang/tflearn_seq2seq The pattern you'd define is just your original sequence. You might also take a look at autoencoders for this problem.
Final Edit: Check this repository: https://github.com/beld/Tensorflow-seq2seq-autoencoder/blob/master/simple_seq2seq_autoencoder.py
I have modified the code in it very slightly to work on the newest version of tensorflow and to make some of the variable names clearer. You should be able to modify it to run on your dataset. Right now I'm just having it autoencode a randomly generated array of 1's and 0's. You would do this for a large subset of your data and then see if other data was reconstructed accurately or not (much higher error than average might imply an anomaly).
import numpy as np
import tensorflow as tf
learning_rate = 0.001
training_epochs = 30000
display_step = 100
hidden_state_size = 100
samples = 10
time_steps = 20
step_dims = 5
test_data = np.random.choice([ 0, 1], size=(time_steps, samples, step_dims))
initializer = tf.random_uniform_initializer(-1, 1)
seq_input = tf.placeholder(tf.float32, [time_steps, samples, step_dims])
encoder_inputs = [tf.reshape(seq_input, [-1, step_dims])]
decoder_inputs = ([tf.zeros_like(encoder_inputs[0], name="GO")]
+ encoder_inputs[:-1])
targets = encoder_inputs
weights = [tf.ones_like(targets_t, dtype=tf.float32) for targets_t in targets]
cell = tf.contrib.rnn.BasicLSTMCell(hidden_state_size)
_, enc_state = tf.contrib.rnn.static_rnn(cell, encoder_inputs, dtype=tf.float32)
cell = tf.contrib.rnn.OutputProjectionWrapper(cell, step_dims)
dec_outputs, dec_state = tf.contrib.legacy_seq2seq.rnn_decoder(decoder_inputs, enc_state, cell)
y_true = [tf.reshape(encoder_input, [-1]) for encoder_input in encoder_inputs]
y_pred = [tf.reshape(dec_output, [-1]) for dec_output in dec_outputs]
loss = 0
for i in range(len(y_true)):
loss += tf.reduce_sum(tf.square(tf.subtract(y_pred[i], y_true[i])))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
x = test_data
for epoch in range(training_epochs):
#x = np.arange(time_steps * samples * step_dims)
#x = x.reshape((time_steps, samples, step_dims))
feed = {seq_input: x}
_, cost_value = sess.run([optimizer, loss], feed_dict=feed)
if epoch % display_step == 0:
print "logits"
a = sess.run(y_pred, feed_dict=feed)
print a
print "labels"
b = sess.run(y_true, feed_dict=feed)
print b
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(cost_value))
print("Optimization Finished!")

Your input shape and the corresponding model depends on why type of Anomaly you want to detect. You can consider:
1. Feature only Anomaly:
Here you consider individual features and decide whether any of them is Anomalous, without considering when its measured. In your example,the feature [torque, speed, acceleration,...] is an anomaly if one or more is an outlier with respect to the other features. In this case your inputs should be of form [batch, features].
2. Time-feature Anomaly:
Here your inputs are dependent on when you measure the feature. Your current feature may depend on the previous features measured over time. For example there may be a feature whose value is an outlier if it appears at time 0 but not outlier if it appears furture in time. In this case you divide each of your trails with overlapping time windows and form a feature set of form [batch, time_window, features].
It should be very simple to start with (1) using an autoencoder where you train an auto-encoder and on the error between input and output, you can choose a threshold like 2-standard devations from the mean to determine whether its an outlier or not.
For (2), you can follow the second paper you mentioned using a seq2seq model, where your decoder error will determine which features are outliers. You can check on this for the implementation of such a model.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.