How to tackle open set classification problem in Python?

How to tackle open set classification problem in Python? - python

I am given an open set Insect classification problem using DNA Barcodes. The goal is to predict species labels for testing samples represented in the training set and predict genus labels for testing samples not represented in the training set. Given data variables are something like this:
gtrain: This is a column vector of size 16128. This variable contains genus level labels for each insect instance in the training set. You can think of these as the parent nodes of the leaf nodes in a tree, where leaf nodes are the species and parent nodes are the genera. All instances with the same gtrain value share the same genus.
ytrain: This is a column vector of size 16128. This variable contains species level labels for each insect instance in the training set. All insect instances with the same ytrain value belong to the same species.
emb_train: This is a 2D matrix of size 16128x1000. Each row in this matrix is a high dimensional encoding (or embedding) of the corresponding nucleotide sequence in the training set.
emb_test: This is a 2D matrix of size 5989x1000. Each row in this matrix is a high dimensional encoding (or embedding) of the corresponding nucleotide sequence in the test set.
I can either predict genus or species labels using the code below by replacing it with gtrain or ytrain variable:
xtrain, xtest, ytrain, ytest = train_test_split(emb_train, gtrain *or* ytrain, test_size=0.3)
classifier=RandomForestClassifier(n_estimators=5)
classifier.fit(xtrain, ytrain.ravel())
ypred=classifier.predict(emb_test)
But I think these predictions are inaccurate because as stated above I need to be able to use both gtrain and ytrain to train my model in some way and make final accurate predictions on emb_test. I am unable to do so this.
Can someone provide some guidance/resources/ideas on how to tackle a problem like this? I can provide more info if something is unclear about the problem.

If gtrain are the parent labels for y_train (IIUC, to visualize all the labels, we could connect the nodes of genus labels to their corresponding species children labels into a depth-2 tree), we could learn to predict both genus label and species label at training time. If I am doing this, I will simply concatenate the label output using both genus label space and species label space.
Let's assume your genus space is 100 (you have 100 unique genus categories), and your species space is 1000 (you have 1000 unique species across all genus).
Your gtrain is 1x16128, this could be transformed to 100x16128 one hot-vector per row.
Your ytrain is 1x16128, this could be transformed to 1000x16128 one hot-vector per row.
After concatenation, you have a label with shape [1100, 16128].
Your could build a model that uses the 1000-dimensional input embedding, connects to a few hidden fully-connected neural network layers and finally connect to the 1100-dimensional output.
At training time, in each step, picks a small batch of examples (say 64 examples out of 16128 totally).
input: 64 x 1000 (batch size x embedding dimension)
output: 64 x 1100 (batch size x output label dimension)
simply reduce cross-entropy loss at output.
At prediction time, you can use some heuristic. For example,
based on the confidence of species output. if all logits from species output nodes are low (the threshold value can be determined with a validation dataset), you probably could predict nothing at species level, but then pick the top prediction from the genus logits.
consider the mutual agreement on the prediction from genus-level logits and species-level logits. IIUC, suppose one genus label has very high logit, but all the corresponding species logits are low (and also vice versa), this could be considered as a "disagreement" thus triggering the logic of not predicting species label but only the genus-level label.
Edit: I also look at your code that uses random forest. In that case, you could build two classifiers using the same embedding feature as input, one predicts into genus label and the other predicts into species label. At inference time, you run two classifiers in parallel, and get both genus-level predictions and species-level predictions. Then you could use the similar heuristics above to decide the final prediction.

Related

XGBoost XGBRegressor predict with different dimensions than fit

I am using the xgboost XGBRegressor to train on a data of 20 input dimensions:
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=20)
model.fit(trainX, trainy, verbose=False)
trainX is 2000 x 19, and trainy is 2000 x 1.
In another word, I am using the 19 dimensions of trainX to predict the 20th dimension (the one dimension of trainy) as the training.
When I am making a prediction:
yhat = model.predict(x_input)
x_input has to be 19 dimensions.
I am wondering if there is a way to keep using the 19 dimensions to train prediction the 20th dimension. But during the prediction, x_input has only 4 dimensions to predict the 20th dimension. It is kinda of a transfer learning to different input dimension.
Does xgboost supports such a feature? I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.

Fundamentally, you're training your model with a dense dataset (19/19 feature values), and are now wondering if you're allowed to make predictions with a sparse dataset (4/19 feature values).
Does xgboost supports such a feature?
Yes, it is technically possible with XGBoost, because XGBoost will treat the absent 15/19 feature values as missing. It will not be possible with some other ML framework (such as Scikit-Learn) that do not work with sparse input by default.
Alternatively, you can make your XGBoost model explicitly "missing-value-proof" by assembling a pipeline which contains feature imputation step(s).
I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.
You should represent missing values as float("NaN") (not as None).

If I understand your question correctly, you are trying to train a model with 19 features, but then feed it only 1 feature to make a prediction.
That's not going to be possible. When you train a model, you are assuming that your data points are drawn from a probability distribution P(X,Y), where Y is your label and X is your features. If you try to change the dimensionality of X, it'll no longer belong to that distribution (at least intuitively, I am not a mathematician so, I cannot come up with a proof for this).
For instance, let's assume your data lies on a 3D cube. That means that you need three coordinate axes to represent a point on it. You cannot place a point using 2 dimensions without assuming the value of the remaining dimension.
You can assume the values of the features you try to drop, but they may not represent the data you originally trained on.

Dividing the training/validation sets by spatial parcellation using KMeans: making sure each subset of training/validation includes all labels

I am hoping to get some help here. I want to use Kmeans clustering to divide the training set for the supervised learning in the next stage, which is called spatial parcellation in the literature.
I have extracted features (X_train) from 3D image data (within a dilated mask), which includes the x, y, and z (positions of voxels), and mask includes two labels (y_train={0 : background,1: obj1}). It means each voxel can be background or object with the mask
I used Kmeans clustering in scikit-learn, and clustered the training dataset's voxel position ([x,y,z]) into 80 different clusters.
Problem the problem is that once I have divided the training set based on 80 clusters,
from sklearn.cluster import KMeans
pos_ind=[0,1,2]
kmeans_model= KMeans(n_clusters=80, random_state=rng).fit(X_TRAIN[:,pos_ind])
later I load the kmeans_model and assign the validation set to the clusters:
Dval_clusters = kmeans_model.predict(X_DVAL[:, pos_ind])
and then I find the row indices of validation set that it falls into cidx th cluster
cluster_idx = np.unique(kmeans_model.labels_)
Dval_rows = np.where(Dval_clusters==cluster_idx[cidx])[0] # find the rows of X_dval that belongs to cidx th cluster
X_dval = X_Dval[Dval_rows]
y_dval = y_Dval[Dval_rows]
Once I am fitting the validation set on the trained model, the trained model may have both labels 0 and 1, however, it is showing an error during the fitting the validation set, which means during assigning the validation samples to a cluster, it has only picked up voxels either with background or obj voxels (i.e., all samples in one validation set of a cluster has only 1 class label).
ValueError: could not broadcast input array from shape (30527,1) into shape (30527,2)
Question I know that clustering does not use labels, but is it possible to enforce the clustering that during the clustering, both labels are sampled? Or is there any trick for doing so? because background is just a class label and we need to to separate the object class (class:1) from the background (class:0)
I would really appreciate if you leave your expert opinion here.

train and predict in RBF Support Vector Machine

I'm trying to run of SVM RBF regression on my train and test dataset.
[svm = SVC(kernel='rbf', random_state=0 , C=C, gamma=0.9)
svm.fit(NewX , NewY)]
the train step works without any problem. However, in the prediction step svm.predict it gives me this error
"ValueError: all the input array dimensions except for the
concatenation axis must match exactly"
Call to the prediction method:
[Z = svm.predict(np.c_[NX_Test.ravel(),NY_Test.ravel()])
Z = Z.reshape(NX_Test.shape)]
Data Format:
My training data set is a list of 80 input examples, where each example is a signal of 100 samples)
My testing data is a list of 20 input examples, where each example is also a signal consisting of 100 samples)
https://pythonspot.com/support-vector-machine/

Did you check if the dimensions of all your training samples match?
An SVM needs the samples, the feature vectors, to have the same dimension.
Consider the follwing feature vector in libSVM format:
1:0.2 2:0.4 3:1.0 4:0.07 5:0.3
The first value represents the index and the second associated value. This vector has a dimension of 5 and hence, all your other feature vectors must match this dimension for training. After the training, the vectors you want to predict must also exactly match this dimension. So, verify that this constraint is satisfied.

LSTM validation

I have a dataset with 100k rows, which are the pairs of store-item numbers (eg. (store 1, item 190)), 300 columns, which are a series of dates (eg. 2017-01-01, 2017-01-02, 2017-01-03 ...). Values are the sales.
I tried to use LSTM keras to predict future sales, how can I construct my train and validation dataset?
I am thinking to split train and validation like data[:, :n_days] and data[:, n_days:]. So I will have same number of samples (100k) in both my train and validation dataset. Do I think it wrong?
If this is the way, how should I define n_days, should the train and validation dataset be exactly the same dimensions? (something like, n_days = 150, 149 days used to predict 1 day).

how can I construct my train and validation dataset?
Not sure if a rule of thumb, but a common approach is to split your dataset into a ~80% training set and ~20% validation set; in your case this would be approximately 80k and 20k. The actual percentages may vary, but that ratio is the one most sources recommend. Ideally you would also want to have a test dataset, one that you never used during training or validation, to evaluate the final performance of your models.
Now, regarding the shape of your data it is important to recall what the keras docs on Recurrent Layers says:
Input shape
3D tensor with shape (batch_size, timesteps, input_dim).
Defining this shape would depend on the nature of your problem. You mention you want to predict sales, so this can be phrased as a Regression Problem. You also mention your data consists of 300 columns that make up your time series, and naturally you have the real sales value for each of those rows.
In this case, using a batch size of 1, your shape seems will be (1, 300, 1). Which means you are training on batches of 1 element (the most thorough Gradient update), where each has 300 time steps and 1 feature or dimension on each time step.
For splitting your data one option you can use that has helped me before is the train_test_split method from Sklearn, where you simply pass your data and labels and indicate the ratio you want:
from sklearn.cross_validation import train_test_split
#Split your data to have 15% validation split
X, X_val, Y, Y_val = train_test_split(data, labels, test_size=0.15)

3darray training/testing TensorFlow RNN LSTM

(I am testing my abilities to write short but effective questions so let me know how I do here)
I am trying to train/test a TensorFlow recurrent neural network, specifically an LSTM, with some trials of time-series data in the following ndarray format:
[[[time_step_trial_0, feature, feature, ...]
[time_step_trial_0, feature, feature, ...]]
[[time_step_trial_1, feature, feature, ...]
[time_step_trial_1, feature, feature, ...]]
[[time_step_trial_2, feature, feature, ...]
[time_step_trial_2, feature, feature, ...]]]
The the 1d portion of this 3darray holds the a time step and all feature values that were observed at that time step. The 2d block contains all 1d arrays (time steps) that were observed in one trial. The 3d block contains all 2d blocks (trials) recorded for the time-series dataset. For each trial, the time step frequency is constant and the window interval is the same across all trials (0 to 50 seconds, 0 to 50 seconds, etc.).
For example, I am given data for Formula 1 race cars such as torque, speed, acceleration, rotational velocity, etc. Over a certain time interval recording time steps every 0.5 seconds, I form 1d arrays with each time step versus the recorded features recorded at that time step. Then I form a 2D array around all time steps corresponding to one Formula 1 race car's run on the track. I create a final 3D array holding all F1 cars and their time-series data. I want to train and test a model to detect anomalies in the F1 common trajectories on the course for new cars.
I am currently aware that the TensorFlow models support 2d arrays for training and testing. I was wondering what procedures I would have to go through in order the be able to train and test the model on all of the independent trials (2d) contained in this 3darray. In addition, I will be adding more trials in the future. So what are the proper procedures to go through in order to constantly be updating my model with the new data/trials to strengthen my LSTM.
Here is the model I was trying to initially replicate for a different purpose other than human activity: https://github.com/guillaume-chevalier/LSTM-Human-Activity-Recognition. Another more feasible model would be this which I would much rather look at for anomaly detection in the time-series data: https://arxiv.org/abs/1607.00148. I want to build a anomaly detection model that given the set of non-anomalous time-series training data, we can detect anomalies in the test data where parts of the data over time is defined as "out of family."

I think for most LSTM's you're going to want to think of your data in this way (as it will be easy to use as input for the networks).
You'll have 3 dimension measurements:
feature_size = the number of different features (torque, velocity, etc.)
number_of_time_steps = the number of time steps collected for a single car
number_of_cars = the number of cars
It will most likely be easiest to read your data in as a set of matrices, where each matrix corresponds to one full sample (all the time steps for a single car).
You can arrange these matrices so that each row is an observation and each column is a different parameter (or the opposite, you may have to transpose the matrices, look at how your network input is formatted).
So each matrix is of size:
number_of_time_steps x feature_size (#rows x #columns). You will have number_of_cars different matrices. Each matrix is a sample.
To convert your array to this format, you can use this block of code (note, you can already access a single sample in your array with A[n], but this makes it so the shape of the accessed elements are what you expect):
import numpy as np
A = [[['car1', 'timefeatures1'],['car1', 'timefeatures2']],
[['car2', 'timefeatures1'],['car2', 'timefeatures2']],
[['car3', 'timefeatures1'],['car3', 'timefeatures2']]
]
easy_format = np.array(A)
Now you can get an individual sample with easy_format[n], where n is the sample you want.
easy_format[1] prints
array([['car2', 'timefeatures1'],
['car2', 'timefeatures2']],
dtype='|S12')
easy_format[1].shape = (2,2)
Now that you can do that, you can format them however you need for the network you're using (transposing rows and columns if necessary, presenting a single sample at a time or all of them at once, etc.)
What you're looking to do (if I'm reading that second paper correctly) most likely requires a sequence to sequence lstm or rnn. Your original sequence is your time series for a given trial, and you're generating an intermediate set of weights (an embedding) that can recreate that original sequence with a low amount of error. You're doing this for all the trials. You will train this lstm on a series of reasonably normal trials and get it to perform well (reconstruct the sequence accurately). You can then use this same set of embeddings to try to reconstruct a new sequence, and if it has a high reconstruction error, you can assume it's anomalous.
Check this repo for a sample of what you'd want along with explanations of how to use it and what the code is doing (it only maps a sequence of integers to another sequence of integers, but can easily be extended to map a sequence of vectors to a sequence of vectors): https://github.com/ichuang/tflearn_seq2seq The pattern you'd define is just your original sequence. You might also take a look at autoencoders for this problem.
Final Edit: Check this repository: https://github.com/beld/Tensorflow-seq2seq-autoencoder/blob/master/simple_seq2seq_autoencoder.py
I have modified the code in it very slightly to work on the newest version of tensorflow and to make some of the variable names clearer. You should be able to modify it to run on your dataset. Right now I'm just having it autoencode a randomly generated array of 1's and 0's. You would do this for a large subset of your data and then see if other data was reconstructed accurately or not (much higher error than average might imply an anomaly).
import numpy as np
import tensorflow as tf
learning_rate = 0.001
training_epochs = 30000
display_step = 100
hidden_state_size = 100
samples = 10
time_steps = 20
step_dims = 5
test_data = np.random.choice([ 0, 1], size=(time_steps, samples, step_dims))
initializer = tf.random_uniform_initializer(-1, 1)
seq_input = tf.placeholder(tf.float32, [time_steps, samples, step_dims])
encoder_inputs = [tf.reshape(seq_input, [-1, step_dims])]
decoder_inputs = ([tf.zeros_like(encoder_inputs[0], name="GO")]
+ encoder_inputs[:-1])
targets = encoder_inputs
weights = [tf.ones_like(targets_t, dtype=tf.float32) for targets_t in targets]
cell = tf.contrib.rnn.BasicLSTMCell(hidden_state_size)
_, enc_state = tf.contrib.rnn.static_rnn(cell, encoder_inputs, dtype=tf.float32)
cell = tf.contrib.rnn.OutputProjectionWrapper(cell, step_dims)
dec_outputs, dec_state = tf.contrib.legacy_seq2seq.rnn_decoder(decoder_inputs, enc_state, cell)
y_true = [tf.reshape(encoder_input, [-1]) for encoder_input in encoder_inputs]
y_pred = [tf.reshape(dec_output, [-1]) for dec_output in dec_outputs]
loss = 0
for i in range(len(y_true)):
loss += tf.reduce_sum(tf.square(tf.subtract(y_pred[i], y_true[i])))
optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
x = test_data
for epoch in range(training_epochs):
#x = np.arange(time_steps * samples * step_dims)
#x = x.reshape((time_steps, samples, step_dims))
feed = {seq_input: x}
_, cost_value = sess.run([optimizer, loss], feed_dict=feed)
if epoch % display_step == 0:
print "logits"
a = sess.run(y_pred, feed_dict=feed)
print a
print "labels"
b = sess.run(y_true, feed_dict=feed)
print b
print("Epoch:", '%04d' % (epoch+1), "cost=", "{:.9f}".format(cost_value))
print("Optimization Finished!")

Your input shape and the corresponding model depends on why type of Anomaly you want to detect. You can consider:
1. Feature only Anomaly:
Here you consider individual features and decide whether any of them is Anomalous, without considering when its measured. In your example,the feature [torque, speed, acceleration,...] is an anomaly if one or more is an outlier with respect to the other features. In this case your inputs should be of form [batch, features].
2. Time-feature Anomaly:
Here your inputs are dependent on when you measure the feature. Your current feature may depend on the previous features measured over time. For example there may be a feature whose value is an outlier if it appears at time 0 but not outlier if it appears furture in time. In this case you divide each of your trails with overlapping time windows and form a feature set of form [batch, time_window, features].
It should be very simple to start with (1) using an autoencoder where you train an auto-encoder and on the error between input and output, you can choose a threshold like 2-standard devations from the mean to determine whether its an outlier or not.
For (2), you can follow the second paper you mentioned using a seq2seq model, where your decoder error will determine which features are outliers. You can check on this for the implementation of such a model.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.