How to train a CNN on an unlabeled dataset? - python

I want to train a CNN on my unlabeled data, and from what I read on Keras/Kaggle/TF documentation or Reddit threads, it looks like I will have to label my dataset beforehand. Is there a way to train the CNN in an unsupervised way?
I cannot understand how to initialize y_train and y_test (where y_train and y_test represent usual meanings)
The information about my dataset is as follows:
I have 50,000 matrices of dimension 30 x 30.
Each matrix is divided into 9 subareas (for understanding, as separated by the vertical and horizontal bars).
A subarea is said to be active if it has at least one element equal to 1. If all elements for that subarea are equal to 0, the subarea is inactive.
For the first example shown below, I should get as output the names of subareas that are active, so here, (1, 4, 5, 6, 7, 9).
If no subarea is active, as in the second example, the output should be 0.
First example: Output - (1, 4, 5, 6, 7, 9)
Second example: Output - 0
After creating these matrices, I did the following:
I put these matrices in a CSV file after reshaping them into vectors of dimension 900 x 1.
So basically, each row in the CSV contains 900 columns with values either 0 or 1.
The classes for my classification problem are numbers from 0-9 where 0 represents the class where no label has an active (value=1) value.
For my model, I want the following:
Input: a 900 x 1 vector as described above.
Output: one of the values from 0-9, where 1-9 represent the active subareas, and 0 represents no active subarea.
What I have done:
I am able to retrieve the data from the CSV file into a data frame and split the data frame into x_train and x_test. But I am unable to understand how to set my y_train and y_test values.
My problem seems very similar to the MNIST dataset, except I don't have the labels. Would it be possible for me to train the model without the labels?
My code currently looks like this:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Read the dataset from the CSV file into a dataframe
df = pd.read_csv("bci_dataset.csv")
# Split the dataframe into training and test dataset
train, test = train_test_split(df, test_size=0.2)
x_train = train.iloc[:, :]
x_test = test.iloc[:, :]
print(x_train.shape)
print(x_test.shape)
Thank you, in advance, for reading this whole thing and helping me out!

Can you tell us why you want to use a CNN specifically? Generally neural networks are used when there's some complication involved in going from feature to output - the artificial neurons are able to learn different behavior as a result of being exposed to the ground truth (i.e., the labels). Most of the time, the researcher using the neural network doesn't even know what features of the input data are being used by the network to come to its output conclusions.
In the case you have given us, it looks a little bit more like you know what features are important (that is, the sum of a subarea has to be greater than 0 in order to be active). The neural network wouldn't need to really learn anything in particular to do its job. Although it doesn't seem necessary to use a neural network for this process, it does make sense for you to automate it, given the size of your input data! :)
Let me know if I'm misunderstanding your situation, though?
Edit: To contrast this with the MNIST dataset - so for identifying handwritten digits, there's some ambiguity that the network has to learn to deal with. Not every kind of handwriting is going to render a 7 the same way. A neural network is able to figure out a couple of the features of a 7 (i.e., there is a high probability that a 7 will have a diagonal line going from top-right-to-bottom-left, which, depending on how you write, could be slightly curved or offset or whatever), as well as a couple of different versions of a 7 (some people do a horizontal slash through the middle of it, other versions of a 7 don't have that slash). The utility of a neural network here is in figuring out all that ambiguity and probabilistically classifying an input as a 7 (because it has seen previous images that it "knows" are 7s). However, in your case, there's only one way for your answer to be rendered - if there's any element greater than 0 in a subarea, it's active! So you don't need to train a network to do anything - you will just need to write some code that automates the summing of the subareas.

Related

XGBoost XGBRegressor predict with different dimensions than fit

I am using the xgboost XGBRegressor to train on a data of 20 input dimensions:
model = xgb.XGBRegressor(objective='reg:squarederror', n_estimators=20)
model.fit(trainX, trainy, verbose=False)
trainX is 2000 x 19, and trainy is 2000 x 1.
In another word, I am using the 19 dimensions of trainX to predict the 20th dimension (the one dimension of trainy) as the training.
When I am making a prediction:
yhat = model.predict(x_input)
x_input has to be 19 dimensions.
I am wondering if there is a way to keep using the 19 dimensions to train prediction the 20th dimension. But during the prediction, x_input has only 4 dimensions to predict the 20th dimension. It is kinda of a transfer learning to different input dimension.
Does xgboost supports such a feature? I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.
Fundamentally, you're training your model with a dense dataset (19/19 feature values), and are now wondering if you're allowed to make predictions with a sparse dataset (4/19 feature values).
Does xgboost supports such a feature?
Yes, it is technically possible with XGBoost, because XGBoost will treat the absent 15/19 feature values as missing. It will not be possible with some other ML framework (such as Scikit-Learn) that do not work with sparse input by default.
Alternatively, you can make your XGBoost model explicitly "missing-value-proof" by assembling a pipeline which contains feature imputation step(s).
I tried just to fill x_input's other dimensions to None, but that yields to terrible prediction results.
You should represent missing values as float("NaN") (not as None).
If I understand your question correctly, you are trying to train a model with 19 features, but then feed it only 1 feature to make a prediction.
That's not going to be possible. When you train a model, you are assuming that your data points are drawn from a probability distribution P(X,Y), where Y is your label and X is your features. If you try to change the dimensionality of X, it'll no longer belong to that distribution (at least intuitively, I am not a mathematician so, I cannot come up with a proof for this).
For instance, let's assume your data lies on a 3D cube. That means that you need three coordinate axes to represent a point on it. You cannot place a point using 2 dimensions without assuming the value of the remaining dimension.
You can assume the values of the features you try to drop, but they may not represent the data you originally trained on.

How to get output with maximum probability from the all the predicted outputs from dense layer?

I trained a neural network for sign language recognition. Here's my output layer model.add(Dense(units=26,activation="softmax"))
Now I'm getting probability for all 26 alphabets. Somehow I'm getting 99% accuracy when I test this model accuracy = model.evaluate(x=test_X,y=test_Y,batch_size=32). I'm new at this. I can't understand how this code works and I'm missing something major here. How to get a 1D list having just the predicted alphabet in it?
To get probabilities you need to do something like this:
prediction = model.predict(test_X)
probs = prediction.max(1)
But it is important to remember that softmax doesn't exactly provide probabilities of each class.
To get outputs with maximum probability in a single list, run:
np.argmax(model.predict(x_test),axis=1)
Supposing alphabet is a list with all alphabet symbols alphabet = ['a', 'b', ...]
pred = model.predict(test_X)
pred_ind = pred.max(1)
pred_alphabet = [alphabet[ind] for ind in pred_ind]
will give you the list with predicted symbols.
In neural networks first layer is for the input image you have. Let's say your image is 32x32 pixels. In that case you would have 32x32x3 nodes in the input layer. This 3 comes for the RGA color scheme. Then depending on your design and model you should use appropriate number of hidden input layers. At most scenarios we use 2 hidden input layers. Then the final layer is for the number of distinct classes you have. Let's say you're going to identify 26 distinct signs. Then you will have 26 nodes in the final layer.
model.evaluate(x=test_X,y=test_Y,batch_size=32)
I think here you're trying to make predictions on your test data set. At first you may have separated your data set into train and test sets. Here test_X stands for the images in test set. test_Y stands for corresponding labels. You're trying to evaluate your network by taking 32 images at a time. That's the meaning of batch_size=32.
I think this information might helpful for you to understand what you're doing. But your question is not clear. Please refer the below tutorial. That might helpful for you.
https://www.pyimagesearch.com/2018/09/10/keras-tutorial-how-to-get-started-with-keras-deep-learning-and-python/

SKlearn prediction on test dataset with different shape from training dataset shape

I'm new to ML and would be grateful for any assistance provided. I've run a linear regression prediction using test set A and training set A. I saved the linear regression model and would now like to use the same model to predict a test set A target using features from test set B. Each time I run the model it throws up the error below
How can I successfully predict a test data set from features and a target with different shapes?
Input
print(testB.shape)
print(testA.shape)
Output
(2480, 5)
(1315, 6)
Input
saved_model = joblib.load(filename)
testB_result = saved_model.score(testB_features, testA_target)
print(testB_result)
Output
ValueError: Found input variables with inconsistent numbers of samples: [1315, 2480]
Thanks again
They are inconsistent shapes which is why the error is being thrown. Have you tried to reshape the data so one of them are same shape? From a quick look, it seems that you have more samples and one less feature in testA.
Think about it, if you have trained your model with 5 features you cannot then ask the same model to make a prediction given 6 features. You speak of using a Linear Regressor, the equation is roughly:
y = b + w0*x0 + w1*x1 + w2*x2 + .. + wN-1*xN-1
Where {
y is your output/label
N is the number of features
b is the bias term
w(i) is the ith weight
x(i) is the ith feature value
}
You have trained a linear regressor with 5 features, effectively producing the following
y (your output/label) = b + w0*x0 + w1*x1 + w2*x2 + w3*x3 + w4*x4
You then ask it to make a prediction given 6 features but it only knows how to deal with 5.
Aside from that issue, you also have too many samples, testB has 2480 and testA has 1315. These need to match, as the model wants to make 2480 predictions, but you only give it 1315 outputs to compare it to. How can you get a score for 1165 missing samples? Do you now see why the data has to be reshaped?
EDIT
Assuming you have datasets with an equal amount of features as discussed above, you may now look at reshaping (removing data) testB like so:
testB = testB[0:1314, :]
testB.shape
(1315, 5)
Or, if you would prefer a solution using the numpy API:
testB = np.delete(testB, np.s_[0:(len(testB)-len(testA))], axis=0)
testB.shape
(1315, 5)
Keep in mind, when doing this you slice out a number of samples. If this is important to you (which it can be) then it may be better to introduce a pre-processing step to help out with the missing values, namely imputing them like this. It is worth noting that the data you are reshaping should be shuffled (unless it is already), as you may be removing parts of the data the model should be learning about. Neglecting to do this could result in a model that may not generalise as well as you hoped.

LSTM validation

I have a dataset with 100k rows, which are the pairs of store-item numbers (eg. (store 1, item 190)), 300 columns, which are a series of dates (eg. 2017-01-01, 2017-01-02, 2017-01-03 ...). Values are the sales.
I tried to use LSTM keras to predict future sales, how can I construct my train and validation dataset?
I am thinking to split train and validation like data[:, :n_days] and data[:, n_days:]. So I will have same number of samples (100k) in both my train and validation dataset. Do I think it wrong?
If this is the way, how should I define n_days, should the train and validation dataset be exactly the same dimensions? (something like, n_days = 150, 149 days used to predict 1 day).
how can I construct my train and validation dataset?
Not sure if a rule of thumb, but a common approach is to split your dataset into a ~80% training set and ~20% validation set; in your case this would be approximately 80k and 20k. The actual percentages may vary, but that ratio is the one most sources recommend. Ideally you would also want to have a test dataset, one that you never used during training or validation, to evaluate the final performance of your models.
Now, regarding the shape of your data it is important to recall what the keras docs on Recurrent Layers says:
Input shape
3D tensor with shape (batch_size, timesteps, input_dim).
Defining this shape would depend on the nature of your problem. You mention you want to predict sales, so this can be phrased as a Regression Problem. You also mention your data consists of 300 columns that make up your time series, and naturally you have the real sales value for each of those rows.
In this case, using a batch size of 1, your shape seems will be (1, 300, 1). Which means you are training on batches of 1 element (the most thorough Gradient update), where each has 300 time steps and 1 feature or dimension on each time step.
For splitting your data one option you can use that has helped me before is the train_test_split method from Sklearn, where you simply pass your data and labels and indicate the ratio you want:
from sklearn.cross_validation import train_test_split
#Split your data to have 15% validation split
X, X_val, Y, Y_val = train_test_split(data, labels, test_size=0.15)

Subsampling + classifying using scikit-learn

I am using Scikit-learn for a binary classification task.. and I have:
Class 0: with 200 observations
Class 1: with 50 observations
And because I have an unbalanced data.. I want to take a random subsample of the majority class where the number of observations will be the same as the minority class and want to use the new obtained dataset as an input to the classifier .. the process of subsampling and classifying can be repeated many times .. I've the following code for the subsampling with mainly the help of Ami Tavory
docs_train=load_files(rootdir,categories=categories, encoding='latin-1')
X_train = docs_train.data
y_train = docs_train.target
majority_x,majority_y=x[y==0,:],y[y==0] # assuming that class 0 is the majority class
minority_x,minority_y=x[y==1,:],y[y==1]
inds=np.random.choice(range(majority_x.shape[0]),50)
majority_x=majority_x[inds,:]
majority_y=majority_y[inds]
It works like a charm, however, at the end of processing the majority_x and majority_y I want to be able to replace the old set that represent class0 in X_train, y_train with the new smaller set in order to pass it as follow to the classifier or the pipeline:
pipeline = Pipeline([
('vectorizer', CountVectorizer( tokenizer=tokens, binary=True)),
('classifier',SVC(C=1,kernel='linear')) ])
pipeline.fit(X_train, y_train)
What I have done In order to solve this:
since the resulted arrays where numpy arrays, and because I am new to the whole area and I am really trying very hard to learn .. I've tried to combine the two resulted arrays together majority_x+minority_x in order to form the training data that I want .. I couldn't it gave some errors which I am trying to solve until now ... but even if I could .. how can I keep their index so the majority_y and minority_y will be true as well !
After processing majority_x and minority_y you can merge your training sets with
X_train = np.concatenate((majority_x,minority_x))
y_train = np.concatenate((majority_y,minority_y))
Now X_train and y_train will first contain the chosen samples with y=0 and then the samples with y=1.
An idea for your related question:
Make your choice of the majority samples by creating a random permutation vector of the length of the number of your majority samples.
Then choose the first 50 indices of that vector, then the next 50 and so on.
When you are through with that vector, each sample will have been chosen exactly once.
If you want more iterations or the remaining permutation vector is too short, you can resort back to random choice.
As I mentioned in my comment, you might also want to add the parameter "replace=False" in your np.random.choice,
if you want to prevent having the same sample multiple times in one iteration.

Categories