How to train an unlabled data in machine learning? - python

I have the data of almost 9000 entities, I want to train my model and detect anomalies from data.
I tried few of things to did my work, and one thing I done is
def create_sequences(values, time_steps=TIME_STEPS):
output = []
for i in range(len(values) - time_steps):
output.append(values[i : (i + time_steps)])
return np.stack(output)
here I start splitting my training data
x_train = create_sequences(data['HR'].values)
x_train = np.expand_dims(x_train,axis=2)
x_train = create_sequences(data['PULSE'].values)
x_train = np.expand_dims(x_train,axis=2)
x_train = create_sequences(data['SpO2'].values)
x_train = np.expand_dims(x_train,axis=2)
x_train = create_sequences(data['ABPDias'].values)
x_train = np.expand_dims(x_train,axis=2)
x_train = create_sequences(data['ABPMean'].values)
x_train = np.expand_dims(x_train,axis=2)
x_train = create_sequences(data['RESP'].values)
x_train = np.expand_dims(x_train,axis=2)
and here is my model for training
model = Sequential()
model.add(Conv1D(filters=32, kernel_size=7, padding="same", strides=2, input_shape=(x_train.shape[1],x_train.shape[2])))
model.add(MaxPooling1D(pool_size=1,padding="valid"))
model.add(Dropout(0.2))
model.add(Conv1D(filters=16, kernel_size=7, padding="same", strides=2))
model.add(LSTM(units=20, return_sequences=True))
model.add(Dropout(0.2))
model.add(Conv1DTranspose(filters=16, kernel_size=7, padding="same",strides=2))
model.add(Conv1D(filters=32, kernel_size=7, padding="same"))
model.add(MaxPooling1D(pool_size=2,padding="valid"))
model.add(Conv1DTranspose(filters=32, kernel_size=7, padding="same",strides=4,activation="relu"))
model.add(Conv1DTranspose(filters=1, kernel_size=7, padding="same"))
model.compile(optimizer="adam", loss="mse")
model.summary()
history = model.fit(
x_train,
x_train,
epochs=150,
batch_size=128,
validation_split=0.1
)
But this took a lot of time. What I am missing?, Can anyone guide me?
And one thing more is, should I use train_test_split for unlabeled data?

You cannot make supervised learning without labeled data. It is not preferable to use features both as an input and label. What you are looking for is clustering-based anomaly detection, which falls under the category of unsupervised learning. DBSCAN might be a good choice for this task, which is available in scikit-learn.

You use some layer to encode and then decode the data. The technique you applied is supervised machine learning(ML). Since your dataset is unlabeled, you need to employ unsupervised ML approaches.
Clustering is a technique for finding patterns in unlabelled data with many dimensions.
There are two different approaches to clustering-based anomaly detection.
1- Unsupervised clustering where the anomaly detection model is trained using unlabelled data that consists of both normal as well as attack traffics.
2- Semi-supervised clustering where the model is trained using normal data only to build a profile of normal activity.

Related

How to pick data for LSTM model in Keras

I'm trying to create a model to predict the stock price in a sequence of data related to some market pair. I'm just starting out with LSTM models and I'm not sure about of the importance of the order in the data.
Consider this frame of my data:
df[55:65]
These data are the features each candle ordered by time appearing and some trading technical indicators I added to test. Finally my value to predict.
I tried to train my model
model = Sequential()
model.add(LSTM(
units=50,
return_sequences=True,
input_shape=(shape_examples, shape_features)
))
model.add(Dropout(0.2))
model.add(LSTM(units=50, return_sequences=True))
model.add(Dropout(0.25))
model.add(LSTM(units=50))
model.add(Dropout(0.25))
model.add(Dense(units=1))
model.compile(optimizer='adam', loss='mean_squared_error')
I am creating batches length 60 to feed the model
def get_dataset(train_data):
data= []
for i in range(60, len(train_data)):
data.append(train_data[i-60:i, :-1])
return np.array(data)
And I am splitting the data in two ways:
# first way
training_data_len = math.ceil(len(Y) * .75)
x_train = X[:training_data_len]
y_train = Y[:training_data_len]
x_test = X[training_data_len:-1]
y_test = Y[training_data_len:-1]
# second way
X_train_, X_test_, Y_train_, Y_test_ = train_test_split(
X, Y, test_size=0.33, random_state=42)
I fitted my model and I got these results:
First and second splitting respectively
The questions are:
Is the order in the data important in this example?
Why did I get those differents results?
Is it the second approach underfitting? Why the test data was easier to predict?

Keras-logits and labels must have the same shape ((None, 1) vs (None, 363))

I have been using "CMU Movie Summary Corpus" for movie genre classification for my own learning since I am relatively new to NLP. This is a multi-label classification task. I am using TF-IDF for word vectorization and then feeding the output into Neural Network but I get the following error when I try to train my model:
ValueError: logits and labels must have the same shape ((None, 1) vs (None, 363))
I have read on StackOverflow and tried to implement the same but the issue persists. However, I am able to train machine learning models using the same TF-IDF features but unable to train the neural network.
Below is the code snippet:
create TF-IDF features
xtrain_tfidf = tfidf_vectorizer.fit_transform(xtrain)
xval_tfidf = tfidf_vectorizer.transform(xval)
xtrain_tfidf[0]
//xtrain_tfidf[0] <1x10000 sparse matrix of type '<class 'numpy.float64'>'with 63 stored elements in Compressed Sparse Row format>
Building Neural Network
xtrain_tfidf = xtrain_tfidf.toarray()
xval_tfidf = xval_tfidf.toarray()
Preprocess the data by setting all arrays to the same size using padding
train_data = keras.preprocessing.sequence.pad_sequences(xtrain_tfidf,
value=0,
padding='post',
maxlen=2000)
test_data = keras.preprocessing.sequence.pad_sequences(xval_tfidf,
value=0,
padding='post',
maxlen=2000)
input shape is the vocabulary count used for the movie reviews (10,000 words)
vocab_size = 10000
model = keras.models.Sequential([
keras.layers.Dense(20, kernel_regularizer=keras.regularizers.l2(0.001),
activation=tf.nn.relu, input_shape=(2000,)),
keras.layers.Dense(16, kernel_regularizer=keras.regularizers.l2(0.001),
activation=tf.nn.relu),
keras.layers.Dense(1, activation=tf.nn.sigmoid)
])
Model features
model.compile(optimizer=tf.optimizers.Adam(),
loss='binary_crossentropy',
metrics=['accuracy'])
Create a cross-validation set
x_val = train_data[:1000]
partial_x_train = train_data[1000:]
y_val = y_train[:1000]
partial_y_train = y_train[1000:]
Train the model for 40 epochs
history = model.fit(partial_x_train,
partial_y_train,
epochs=40,
batch_size=512,
validation_data=(x_val, y_val),
verbose=2)
If it is a multi-class classification task thus this
keras.layers.Dense(1, activation=tf.nn.sigmoid)
should have as many outputs as there are classes (363?) and you should use softmax, not sigmoid (unless a single point can belong to multiple classes?)
keras.layers.Dense(363, activation=tf.nn.softmax)
Also your loss should not be binary_crossentropy but categorical_crossentropy for the same reason.

How do I select train data for LSTM network training

I'm basically new to RNNs, but I'm trying to predict signals based on recordings.
I have two sets of data A and B - A is the raw data recording, and B is the binary labeled data marking '1' for every active event on A, both with shape (1895700,1)
Could you help me figure out what should be used as x and y train?
I been reading about this and understood to loop through A and extract x and y from here. did thi and got input shape of x_train - (189555, 150, 1) y_train - (189555, 150, 1) but getting accuracy of: 0.0000e+00 and negative loss.
My other approach was using A as x_train and B as y_train with input shapes of (12638,150,1) but from first step of epoch 1, had accuracy of: 96 and around .10 loss. they didnt vary much throughout training
So I'm not really sure what data should be my input
model:
model = Sequential()
model.add(LSTM(128, dropout=0.5, input_shape=(ts,features), recurrent_dropout=0.4, return_sequences=True))
model.add(LSTM(128, dropout=0.5, input_shape=(ts,features), recurrent_dropout=0.3, return_sequences=True))
model.add(LSTM(64, dropout=0.5, input_shape=(ts,features), recurrent_dropout=0.3, return_sequences=True))
model.add(Dense(features, input_shape=(ts, features), activation="sigmoid"))
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
Thanks in advance!
Your X_train is the data that represent your features. While Y_train is the data that represents the output for the X_train features.
you can split your data by simply providing a parameter validation_split to the fit function:
model.fit(X_data, Y_data, batch_size=4, epochs=5, verbose=1, validation_split=0.2)
in this case it will split 20% of the data for validation.

RNN model predicting only one class?

I am trying to use GloVe embeddings to train a rnn model based on this article.
I have a labeled data: text(tweets) on one column, labels on another (hate, offensive or neither).
However the model seems to predict only one class in the result.
This is the LSTM model:
model = Sequential()
hidden_layer = 3
gru_node = 32
# model embedding matrix here....
for i in range(0,hidden_layer):
model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2))
model.add(Dropout(dropout))
model.add(GRU(gru_node, recurrent_dropout=0.2))
model.add(Dropout(dropout))
model.add(Dense(64, activation='softmax'))
model.add(Dense(nclasses, activation='softmax'))
start=time.time()
model.compile(loss='sparse_categorical_crossentropy',optimizer='adam',metrics=['accuracy'])
fitting the model:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)
X_train_Glove,X_test_Glove, word_index, embeddings_index = loadData_Tokenizer(X_train, X_test)
model_RNN = Build_Model_RNN_Text(word_index,embeddings_index, 20)
model_RNN.fit(X_train_Glove,y_train,
validation_data=(X_test_Glove, y_test),
epochs=4,
batch_size=128,
verbose=2)
y_preds = model_RNN.predict_classes(X_test_Glove)
print(metrics.classification_report(y_test, y_preds))
Results:
classification report
Confusion matrix
Am I missing something here?
Update:
this is what the distribution looks like
and the model summary, more or less
How the distribution of your data looks like? The first suggestion is to stratify train/test split (here is the link for the documentation).
The second question is how much data do you have in comparison with the complexity of the model? Maybe, your model is so complex, that just do overfitting. You can use the command model.summary() to see the number of trainable parameters.

How to deal with large layers in Keras

I'm trying to create a deep learning model with Keras, but when I'm trying make the model, keras uses too much memory.
I'm trying to make a model with a large amount of inputs (24,000) and it seems too much for my computer.
Here is my simple code:
N=20
M=5
n = 24310
nbclass = 4
X_train = np.random.rand(N,n)
X_test = np.random.rand(M,n)
Y_train = np_utils.to_categorical(np.random.randint(nbclass, size=N), nbclass)
Y_test = np_utils.to_categorical(np.random.randint(nbclass, size=M), nbclass)
model.add(Dense(n, input_dim=n, activation="relu", kernel_initializer="uniform"))
model.add(Dropout(0.25))
model.add(Dense(n//2, activation="relu", kernel_initializer="uniform"))
model.add(Dropout(0.25))
model.add(Dense(nbclass,activation="softmax", kernel_initializer="uniform"))
# Compiling the model
model.compile(loss='binary_crossentropy',
optimizer='adam', metrics=['accuracy'])
# Fitting the model
model.fit(X_train, Y_train, validation_data=(X_test, Y_test),epochs=100, batch_size=5,callbacks=[PlotLosses()])
My issue here is, when the fit is done, it consumes all my memory (20 Go), then it starting to swap and I cannot go further.
Somebody knows how can I modify the model to make it less memory consuming?

Categories