Autoencoder Custom Error Metric not working as intended

Autoencoder Custom Error Metric not working as intended - python

I have created a custom error metric to measure the success of an autoencoder implemented in Keras and sklearn. At a high level, it computes the loss for each of the output neurons separately (num_features = num_input_neurons = num_output_neurons), and returns the highest and lowest loss of the features over the whole testing dataset, as well as the names of the two features themselves.
The implementation of the custom loss functions look like this:
import keras.backend as K
def get_argmax_max(features, diff_list):
return [features[K.argmax(diff_list)], K.max(diff_list).numpy()]
def get_argmin_min(features, diff_list):
return [features[K.argmin(diff_list)], K.min(diff_list).numpy()]
# features = [list of feature names]
decoded_prediction = test_autoencoder.predict(testing_data)
individual_diff_list = list(abs(x-y) for x, y in zip(decoded_prediction, testing_data))
error_list = np.mean(individual_diff_list, axis=0)
max_err = get_argmax_max(features, error_list)
min_err = get_argmin_min(features, error_list)
I have added two columns to train and test - one completely random, and one uniform (eg: all 1s) - in order to test the success of this custom metric. My hypothesis is that the custom metric will identify the uniform column as having the minimum error (as it is very simple to learn) and the random column as having the maximum error (as it is not possible to learn well). In practice, however, this is not the case, and the metric often selects random other columns instead of the expected uniform and random columns. I have ensured that the range of the random column exceeds that of all columns in the dataset, tested with sufficiently large encoding dimension, and have also tried running for a large number of epochs to no avail.

Related

SKlearn prediction on test dataset with different shape from training dataset shape

I'm new to ML and would be grateful for any assistance provided. I've run a linear regression prediction using test set A and training set A. I saved the linear regression model and would now like to use the same model to predict a test set A target using features from test set B. Each time I run the model it throws up the error below
How can I successfully predict a test data set from features and a target with different shapes?
Input
print(testB.shape)
print(testA.shape)
Output
(2480, 5)
(1315, 6)
Input
saved_model = joblib.load(filename)
testB_result = saved_model.score(testB_features, testA_target)
print(testB_result)
Output
ValueError: Found input variables with inconsistent numbers of samples: [1315, 2480]
Thanks again

They are inconsistent shapes which is why the error is being thrown. Have you tried to reshape the data so one of them are same shape? From a quick look, it seems that you have more samples and one less feature in testA.
Think about it, if you have trained your model with 5 features you cannot then ask the same model to make a prediction given 6 features. You speak of using a Linear Regressor, the equation is roughly:
y = b + w0*x0 + w1*x1 + w2*x2 + .. + wN-1*xN-1
Where {
y is your output/label
N is the number of features
b is the bias term
w(i) is the ith weight
x(i) is the ith feature value
}
You have trained a linear regressor with 5 features, effectively producing the following
y (your output/label) = b + w0*x0 + w1*x1 + w2*x2 + w3*x3 + w4*x4
You then ask it to make a prediction given 6 features but it only knows how to deal with 5.
Aside from that issue, you also have too many samples, testB has 2480 and testA has 1315. These need to match, as the model wants to make 2480 predictions, but you only give it 1315 outputs to compare it to. How can you get a score for 1165 missing samples? Do you now see why the data has to be reshaped?
EDIT
Assuming you have datasets with an equal amount of features as discussed above, you may now look at reshaping (removing data) testB like so:
testB = testB[0:1314, :]
testB.shape
(1315, 5)
Or, if you would prefer a solution using the numpy API:
testB = np.delete(testB, np.s_[0:(len(testB)-len(testA))], axis=0)
testB.shape
(1315, 5)
Keep in mind, when doing this you slice out a number of samples. If this is important to you (which it can be) then it may be better to introduce a pre-processing step to help out with the missing values, namely imputing them like this. It is worth noting that the data you are reshaping should be shuffled (unless it is already), as you may be removing parts of the data the model should be learning about. Neglecting to do this could result in a model that may not generalise as well as you hoped.

Can Neural Network model use Weighted Mean (Sum) Squared Error as its loss function?

I am nooby in this field of study and probably this is a pretty silly question. I want to build a normal ANN, but I am not sure if I can use a weighted mean square error as the loss function.
If we are not treating each sample equally, I mean we care the prediction precision more for some of the categories of the samples more than the others, then we want to form a weighted loss function.
Lets say, we have a categorical feature ci, i is the index of the sample, and for simplicity, we assume that this feature takes binary value, either 0 or 1. So, we can form the loss function as
(ci + 1)(yi_hat - yi)^2
#and take the sum for all i
Are there going to be any problem with the back-propagation? I don't see any issue with calculating the gradient or updating the weights between layers.
And, if no issue, how can I program this loss function in Keras? Because it seems that the loss function only takes two parameters, y_true and y_pred, how can I plug in the vector c?

There is absolutely nothing wrong with that. Functions can declare the constants withing themselves or even take the constants from an outside scope:
import keras.backend as K
c = K.constant([c1,c2,c3,c4,...,cn])
def weighted_loss(y_true,y_pred):
loss = keras.losses.get('mse')
return c * loss(y_true,y_pred)
Exactly like yours:
def weighted_loss(y_true,y_pred):
weighted = (c+1)*K.square(y_true-y_pred)
return K.sum(weighted)

3darray training/testing TensorFlow RNN LSTM

(I am testing my abilities to write short but effective questions so let me know how I do here)
I am trying to train/test a TensorFlow recurrent neural network, specifically an LSTM, with some trials of time-series data in the following ndarray format:
[[[time_step_trial_0, feature, feature, ...]
[time_step_trial_0, feature, feature, ...]]
[[time_step_trial_1, feature, feature, ...]
[time_step_trial_1, feature, feature, ...]]
[[time_step_trial_2, feature, feature, ...]
[time_step_trial_2, feature, feature, ...]]]
The the 1d portion of this 3darray holds the a time step and all feature values that were observed at that time step. The 2d block contains all 1d arrays (time steps) that were observed in one trial. The 3d block contains all 2d blocks (trials) recorded for the time-series dataset. For each trial, the time step frequency is constant and the window interval is the same across all trials (0 to 50 seconds, 0 to 50 seconds, etc.).
For example, I am given data for Formula 1 race cars such as torque, speed, acceleration, rotational velocity, etc. Over a certain time interval recording time steps every 0.5 seconds, I form 1d arrays with each time step versus the recorded features recorded at that time step. Then I form a 2D array around all time steps corresponding to one Formula 1 race car's run on the track. I create a final 3D array holding all F1 cars and their time-series data. I want to train and test a model to detect anomalies in the F1 common trajectories on the course for new cars.
I am currently aware that the TensorFlow models support 2d arrays for training and testing. I was wondering what procedures I would have to go through in order the be able to train and test the model on all of the independent trials (2d) contained in this 3darray. In addition, I will be adding more trials in the future. So what are the proper procedures to go through in order to constantly be updating my model with the new data/trials to strengthen my LSTM.
Here is the model I was trying to initially replicate for a different purpose other than human activity: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition. Another more feasible model would be this which I would much rather look at for anomaly detection in the time-series data: https://arxiv.org/abs/1607.00148. I want to build a anomaly detection model that given the set of non-anomalous time-series training data, we can detect anomalies in the test data where parts of the data over time is defined as "out of family."

I think for most LSTM's you're going to want to think of your data in this way (as it will be easy to use as input for the networks).
You'll have 3 dimension measurements:
feature_size = the number of different features (torque, velocity, etc.)
number_of_time_steps = the number of time steps collected for a single car
number_of_cars = the number of cars
It will most likely be easiest to read your data in as a set of matrices, where each matrix corresponds to one full sample (all the time steps for a single car).
You can arrange these matrices so that each row is an observation and each column is a different parameter (or the opposite, you may have to transpose the matrices, look at how your network input is formatted).
So each matrix is of size:
number_of_time_steps x feature_size (#rows x #columns). You will have number_of_cars different matrices. Each matrix is a sample.
To convert your array to this format, you can use this block of code (note, you can already access a single sample in your array with A[n], but this makes it so the shape of the accessed elements are what you expect):
import numpy as np
A = [[['car1', 'timefeatures1'],['car1', 'timefeatures2']],
[['car2', 'timefeatures1'],['car2', 'timefeatures2']],
[['car3', 'timefeatures1'],['car3', 'timefeatures2']]
]
easy_format = np.array(A)
Now you can get an individual sample with easy_format[n], where n is the sample you want.
easy_format[1] prints
array([['car2', 'timefeatures1'],
['car2', 'timefeatures2']],
dtype='|S12')
easy_format[1].shape = (2,2)
Now that you can do that, you can format them however you need for the network you're using (transposing rows and columns if necessary, presenting a single sample at a time or all of them at once, etc.)
What you're looking to do (if I'm reading that second paper correctly) most likely requires a sequence to sequence lstm or rnn. Your original sequence is your time series for a given trial, and you're generating an intermediate set of weights (an embedding) that can recreate that original sequence with a low amount of error. You're doing this for all the trials. You will train this lstm on a series of reasonably normal trials and get it to perform well (reconstruct the sequence accurately). You can then use this same set of embeddings to try to reconstruct a new sequence, and if it has a high reconstruction error, you can assume it's anomalous.
Check this repo for a sample of what you'd want along with explanations of how to use it and what the code is doing (it only maps a sequence of integers to another sequence of integers, but can easily be extended to map a sequence of vectors to a sequence of vectors): https://github.com/ichuang/tflearn_seq2seq The pattern you'd define is just your original sequence. You might also take a look at autoencoders for this problem.
Final Edit: Check this repository: https://github.com/beld/Tensorflow-seq2seq-autoencoder/blob/master/simple_seq2seq_autoencoder.py
I have modified the code in it very slightly to work on the newest version of tensorflow and to make some of the variable names clearer. You should be able to modify it to run on your dataset. Right now I'm just having it autoencode a randomly generated array of 1's and 0's. You would do this for a large subset of your data and then see if other data was reconstructed accurately or not (much higher error than average might imply an anomaly).
import numpy as np
import tensorflow as tf
learning_rate = 0.001
training_epochs = 30000
display_step = 100
hidden_state_size = 100
samples = 10
time_steps = 20
step_dims = 5
test_data = np.random.choice([ 0, 1], size=(time_steps, samples, step_dims))
initializer = tf.random_uniform_initializer(-1, 1)
seq_input = tf.placeholder(tf.float32, [time_steps, samples, step_dims])
encoder_inputs = [tf.reshape(seq_input, [-1, step_dims])]
decoder_inputs = ([tf.zeros_like(encoder_inputs[0], name="GO")]
+ encoder_inputs[:-1])
targets = encoder_inputs
weights = [tf.ones_like(targets_t, dtype=tf.float32) for targets_t in targets]
cell = tf.contrib.rnn.BasicLSTMCell(hidden_state_size)
_, enc_state = tf.contrib.rnn.static_rnn(cell, encoder_inputs, dtype=tf.float32)
cell = tf.contrib.rnn.OutputProjectionWrapper(cell, step_dims)
dec_outputs, dec_state = tf.contrib.legacy_seq2seq.rnn_decoder(decoder_inputs, enc_state, cell)
y_true = [tf.reshape(encoder_input, [-1]) for encoder_input in encoder_inputs]
y_pred = [tf.reshape(dec_output, [-1]) for dec_output in dec_outputs]
loss = 0
for i in range(len(y_true)):
loss += tf.reduce_sum(tf.square(tf.subtract(y_pred[i], y_true[i])))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
x = test_data
for epoch in range(training_epochs):
#x = np.arange(time_steps * samples * step_dims)
#x = x.reshape((time_steps, samples, step_dims))
feed = {seq_input: x}
_, cost_value = sess.run([optimizer, loss], feed_dict=feed)
if epoch % display_step == 0:
print "logits"
a = sess.run(y_pred, feed_dict=feed)
print a
print "labels"
b = sess.run(y_true, feed_dict=feed)
print b
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(cost_value))
print("Optimization Finished!")

Your input shape and the corresponding model depends on why type of Anomaly you want to detect. You can consider:
1. Feature only Anomaly:
Here you consider individual features and decide whether any of them is Anomalous, without considering when its measured. In your example,the feature [torque, speed, acceleration,...] is an anomaly if one or more is an outlier with respect to the other features. In this case your inputs should be of form [batch, features].
2. Time-feature Anomaly:
Here your inputs are dependent on when you measure the feature. Your current feature may depend on the previous features measured over time. For example there may be a feature whose value is an outlier if it appears at time 0 but not outlier if it appears furture in time. In this case you divide each of your trails with overlapping time windows and form a feature set of form [batch, time_window, features].
It should be very simple to start with (1) using an autoencoder where you train an auto-encoder and on the error between input and output, you can choose a threshold like 2-standard devations from the mean to determine whether its an outlier or not.
For (2), you can follow the second paper you mentioned using a seq2seq model, where your decoder error will determine which features are outliers. You can check on this for the implementation of such a model.

Custom Loss Function in TensorFlow for weighting training data

I want to weight the training data based on a column in the training data set. Thereby giving more importance to certain training items than others. The weighting column should not be included as a feature for the input layer.
The Tensorflow documentation holds an example how to use the label of the item to assign a custom loss and thereby assigning weight:
# Ensures that the loss for examples whose ground truth class is `3` is 5x
# higher than the loss for all other examples.
weight = tf.multiply(4, tf.cast(tf.equal(labels, 3), tf.float32)) + 1
onehot_labels = tf.one_hot(labels, num_classes=5)
tf.contrib.losses.softmax_cross_entropy(logits, onehot_labels, weight=weight)
I am using this in a custom DNN with three hidden layers. In theory i simply need to replace labels in the example above with a tensor containing the weight column.
I am aware that there are several threads that already discuss similar problems e.g. defined loss function in tensorflow?
For some reason i am running into a lot of problems trying to bring my weight column in. It's probably two easy lines of code or maybe there is an easier way to achieve the same result.

I believe i found the answer:
weight_tf = tf.range(features.get_shape()[0]-1, features.get_shape()[0])
loss = tf.losses.softmax_cross_entropy(target, logits, weights=weight_tf)
The weight is the last column in the features tensorflow.

how to set the number of features to use in random selection sklearn

I am using sklearn RandomForest Classifier/Bag classifier for learning and I am not getting the expected results when compared to Java/Weka Machine Learning library.
In Weka, I am learning the model with - Random forest of 10 trees, each constructed while considering 6 random features. (setNumFeatures need to be set and default is 10 trees)
In sklearn - I am not sure how to specify the number of features to randomly consider while constructing a random forest of 10 trees. This what I am doing:
rf_classifier = RandomForestClassifier(n_estimators=num_trees, max_features=6)
rf_classifier = rf_classifier.fit(train_file, train_file_label)
for items in rf_classifier.estimators_:
classifier_list.append(items)
I saw the docs and there is a parameter - max_features but I am not sure if that serves the purpose. I get this error when I am trying to calculate entropy:
# code to calculate voting entropy for all features (unlabeled data)
vote_count_for_features = list(classifier_list[0].predict(feature_data_arr))
for i in range(1, len(classifier_list)):
res_temp = []
res_temp = list(classifier_list[i].predict(feature_data_arr))
vote_count_for_features = [x + y for x, y in zip(vote_count_for_features, res_temp)]
If I set that parameter to 6, than my code fails with the error message:
Number of features of the model must match the input. Model n_features
is 6 and input n_features is 31
Inputs: Sample set of 1 million records with 31 features. When I run weka, the number of rules extracted are around 1000 whereas when I run the same thing through sklearn - I get hardly 70 rules.
I am new to python and sklearn and I am trying to figure out where am I doing wrong. (Weka code has been tested well and gives 95% precision, 80% recall - so I am assuming that's good)
Note: I have used sklearn imputer to impute missing values using 'mean' strategy whereas Weka has ways to handle NaN.
This is what I am trying to achieve: Learn Random Forest on a sample set, extract rules, evaluate rules and then apply on the bigger set
Any suggestions or input will really help me debug through the issue and solve it quickly.

I think the issue is that the individual trees get confused since they only use 6 features, but you give them 31. You can try to get the prediction to work by setting check_input = False:
list(classifier_list[i].predict(feature_data_arr, check_input = False))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.