IndexError: boolean index vs indexed array mismatch in numpy ndarray - python

I am working on a multiclass problem (5-classes, highly imbalanced dataset). I would like to implement an ensemble of convolutional auto-encoders where each auto-encoder is trained on a single class, and then ensemble to obtain the final classification results.
I am however stuck at a point to train each encoder per class. I'm getting the error that I believe has to do with my logic dealing with arrays of the class labels:
IndexError: boolean index did not match indexed array along dimension 1; dimension is 1 but corresponding boolean dimension is 5
I am working with really huge dataset, but I provide an MWE for a 3-class problem to reproduce similar situation below:
#..scikitlearn, keras, numpy ....libraries import
class SingleAED:
def __init__(self, train, test):
self.x_train = train
self.x_test = test
def setSingleModel(self):
autoencoder = Sequential()
activ = 'relu'
autoencoder.add(Conv2D(32, (1, 3), strides=(1, 1), padding='same', activation=activ, input_shape=(1, Threshold, 4)))
autoencoder.add(BatchNormalization(axis = 3))
autoencoder.add(Conv2D(32, (1, 3), strides=(1, 1), padding='same', activation=activ ))
autoencoder.add(BatchNormalization(axis = 3))
autoencoder.add(MaxPooling2D(pool_size=(1, 2) ))
autoencoder.compile(optimizer='adam', loss='mae', metrics=['mean_squared_error'])
filepath = "weights.best.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=1, save_best_only=True, mode='max')
callbacks_list = [checkpoint]
autoencoder.fit(self.x_train, self.x_train, epochs=250, batch_size=256, shuffle=True,callbacks=callbacks_list)
return autoencoder
#generate dummy data
X = np.random.randn(20, 1, 5, 4)
a,b,c = np.repeat(0, 7), np.repeat(1, 7), np.repeat(2, 6)
y = np.hstack((a,b,c))
LABELS= list(set(np.ndarray.flatten(y)))
Threshold = len(X[0, 0, :, 0])
NoClass = len(LABELS)
#train-test split
x_train, x_test, y_train, y_test = train_test_split(X, y,
test_size=0.20, random_state=7)
#...to categorical
y_train = keras.utils.to_categorical(y_train, num_classes=NoClass)
y_test = keras.utils.to_categorical(y_test, num_classes=NoClass)
#train an auto-encoder per class
ensemble = []
for i in range(len(LABELS)):
print(LABELS[i])
sub_train = x_train[y_train == i]
sub_test = x_test[y_test == i]
autoencoder = SingleAED(sub_train, sub_test)
autoencoder = autoencoder.setSingleModel()
ensemble.append(autoencoder)
Error:
0
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-98-e00f5454d8b5> in <module>()
2 for i in range(len(LABELS)):
3 print(LABELS[i])
----> 4 sub_train = x_train[y_train == i]
5 sub_test = x_test[y_test == i]
6
IndexError: boolean index did not match indexed array along dimension 1; dimension is 1 but corresponding boolean dimension is 3
In this case, I want to loop through the classes 0..2 to train an encoder per class. I am not sure why I get this error, can someone help sort this out?

You want to index the x_train array with y_train before it is converted to categorical.
x_train, x_test, y_train, y_test = train_test_split(X, y,
test_size=0.20,
random_state=7)
# EDIT: DO NOT OVERRIDE!
y_train_cat = keras.utils.to_categorical(y_train, num_classes=NoClass)
y_test_cat = keras.utils.to_categorical(y_test, num_classes=NoClass)
#train an auto-encoder per class
ensemble = []
for i in range(len(LABELS)):
print(LABELS[I])
# EDIT: USE NON-CATEGORICAL
sub_train = x_train[y_train == i]
sub_test = x_test[y_test == i]
autoencoder = SingleAED(sub_train, sub_test)
autoencoder = autoencoder.setSingleModel()
ensemble.append(autoencoder)

Related

Keras inverse scaling prediction from model causes problems with broadcasting with shapes

I
have built multi classification model with Keras and after model is finished I would like to predict value for one of my test input.
This is the part where I scaled features:
x = dataframe.drop("workTime", axis = 1)
x = dataframe.drop("creation", axis = 1)
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x = pd.DataFrame(sc.fit_transform(x))
y = dataframe["workTime"]
import seaborn as sb
corr = dataframe.corr()
sb.heatmap(corr, cmap="Blues", annot=True)
print("Scaled features:", x.head(3))
Then I did:
y_cat = to_categorical(y)
x_train, x_test, y_train, y_test = train_test_split(x.values, y_cat, test_size=0.2)
And built model:
model = Sequential()
model.add(Dense(16, input_shape = (9,), activation = "relu"))
model.add(Dense(8, activation = "relu"))
model.add(Dropout(0.5))
model.add(Dense(6, activation = "softmax"))
model.compile(Adam(lr = 0.0001), "categorical_crossentropy", metrics = ["categorical_accuracy"])
model.summary()
model.fit(x_train, y_train, verbose=1, batch_size = 8, epochs=100, shuffle=True)
After my calculation finished, I wanted to take first element from test data and predict
value/classify it.
print(x_test.shape, x_train.shape) // (1550, 9) (6196, 9)
firstTest = x_test[:1]; // [[ 2.76473141 1.21064165 0.18816548 -0.94077449 -0.30981017 -0.37723917
-0.44471711 -1.44141792 0.20222467]]
prediction = model.predict(firstTest)
print(prediction) // [[7.5265622e-01 2.4710520e-01 2.3643016e-04 2.1405797e-06 3.8411264e-19
9.4137732e-23]]
print(prediction[0]) // [7.5265622e-01 2.4710520e-01 2.3643016e-04 2.1405797e-06 3.8411264e-19
9.4137732e-23]
unscaled = sc.inverse_transform(prediction)
print("prediction", unscaled)
During this I retrieve:
ValueError: operands could not be broadcast together with shapes (1,6) (9,) (1,6)
I think it may be related to my scalers.
And please correct me if I wrong, but what I want to achieve here is to either have one output value which points me how this entry was classified or array of possibilities for each classification label.
Thank you for hints
Your StandardScaler was used to scale the input features, you can't apply it (or its inverse) on the outputs!
If you are looking for the probabilities of the test sample being in each class, you already have it in prediction[0].
If you want the final class predicted, just take the one with the largest probability with argmax: tf.math.argmax(prediction[0]).

LSTM/GRU TImeSeries multioutput strategy forecasts give dropped values

Currently, I'm playing with Stocks Predictions task which I try to solve using LSTM/GRU.
Problem: After training LSTM/GRU I get huge drop predicted values
Model training process
Train, test data is simply generated using pd.shift in series_to_supervised function.
df['Mid'] = df['Low'] + df['High'] / 2
n_lag = 1 # Lag columns back
n_seq = 1*50 # TimeSteps to predict
seq_col = 'Mid'
seq_col_t = f'{seq_col}(t)'
split_date = '2018-01-01'
def series_to_supervised(data: pd.DataFrame,
seq_col: str,
n_in: int = 1,
n_out: int = 1,
drop_seq_col: bool = True,
dropna: bool = True):
"""Convert time series into supervised learning problem
{input sequence, forecast sequence}
"""
# input sequence (t-n, ... t-1) -> pisitive shift
for i in range(n_in, 0, -1):
data[f'{seq_col}(t-{i})'] = data[seq_col].shift(i)
# no sequence -> no shift
data[f'{seq_col}(t)'] = data[seq_col]
for i in range(1, n_out+1):
# forecast sequence (t, t+1, ... t+n) -> negative shift
data[f'{seq_col}(t+{i})'] = data[seq_col].shift(-i)
if drop_seq_col:
data = data.drop(seq_col, axis=1)
if dropna:
data.dropna(inplace=True)
return data
df = series_to_supervised(df, seq_col=seq_col, n_in=n_lag, n_out=n_seq)
mask = df.index < split_date
train, test = df[mask], df[~mask]
X_cols = ['Mid(t-1)']
y_cols = train.filter(like='Mid(t+').columns
X_train, y_train, X_test, y_test = train[X_cols], train[y_cols], test[X_cols], test[y_cols]
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(-1, 1))
# also returns np.ndarray
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)
y_train = y_train.values
y_test = y_test.values
from keras.models import Sequential
from keras.layers import Dense, Dropout, LSTM, GRU
from keras.optimizers import Adam, RMSprop, Adamax
from keras.callbacks import ModelCheckpoint
def get_model(X, y, n_batch):
num_classes=y.shape[1]
# design network
model = Sequential()
# For Stock Predictions has to be used LSTM stateful=True
model.add(GRU(10, batch_input_shape=(n_batch, X.shape[1], X.shape[2]), stateful=True))
model.add(Dropout(0.3))
model.add(Dense(num_classes))
opt = Adam(learning_rate=0.01)
# opt = RMSprop(learning_rate=0.001)
model.compile(loss='mean_squared_error', optimizer=opt)
return model
def reshape_batch(X_train, y_train, X_test, y_test, n_batch):
# reshape training into [samples, timesteps, features]
X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1])
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1])
# cut to equally divided n_batches (without reminder).
# needed for LSTM stateful=True
train_cut = X_train.shape[0] % n_batch
test_cut = X_test.shape[0] % n_batch
if train_cut > 0:
X_train = X_train[:-train_cut]
y_train = y_train[:-train_cut]
if test_cut > 0:
X_test = X_test[:-test_cut]
y_test = y_test[:-test_cut]
return X_train, y_train, X_test, y_test
# fit an LSTM network to training data
def fit_lstm(X_train: np.ndarray,
y_train: np.ndarray,
n_lag: int,
n_seq: int,
n_batch: int,
nb_epoch: int,
X_test: np.ndarray=None,
y_test: np.ndarray=None):
model = get_model(X_train, y_train, n_batch)
# fit network
history = model.fit(X_train, y_train, validation_data=(X_test, y_test), callbacks=None,
epochs=nb_epoch, batch_size=n_batch, verbose=1, shuffle=False)
print('Predict:', model.predict(X_test, batch_size=n_batch))
model.reset_states()
return model, history
n_batch = 32
nb_epoch = 40
X_train, y_train, X_test, y_test = reshape_batch(X_train, y_train, X_test, y_test, n_batch)
model, history = fit_lstm(X_train, y_train, n_lag, n_seq, n_batch, nb_epoch, X_test=X_test, y_test=y_test)
What I Have tried
Different optimizers (kinda all available in keras)
DIfferent recurrent network structures (GRU/LSTM)
Different learning rates
Different epochs from 1 to 1500
Adding/Removing Drop layers with different params (0.1-0.7)
Different LSTM/GRU amount of neurons (1-100)
Number of LSTM/GRU layers, via return_sequences params with more Drop layers.
Different number of forecasts(t+1,t+2 ... t+n) features from 1-365
Different number of lag (t-1, t-2, t-n ...) features from 1-5
Different scale normalization borders (0,1) and (-1,1)
Different n_batch values: 1,8,16,32
What can affect LSTM/GRU give so strange behaviour? And What else should I try to make it work the normal way?

Tensorflow Neural Network with more than 2 categories

So i watched a tensorflow tutorial on udemy and decided to try for myself, he said if you want more than 2 categories change the activation to "softmax" and units to 4 since i have 4 different categories it could be in (changed from 0:1, to 1:4), everything works if there is only 2 different values in "y" but as soon as i change it to 4 units and 4 categories i get error:
ValueError: Error when checking target: expected dense_3 to have shape (4,) but got array with shape (1,)
even changing it back to shape "1" just results in true or false category
my dataset in y:
import numpy as np
dataset = np.load('/Users/alex/desktop/ANN_dataset_for_brightness.npy')
X = dataset[:, 0:17]
y = dataset[:, 17:19]
for i in range (27):
if y[i] == 400:
y[i] = 4
elif y[i] == 300:
y[i] = 3
elif y[i] == 200:
y[i] = 2
elif y[i] == 100:
y[i] = 1
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Importing the Keras libraries and packages
from keras.models import Sequential
from keras.layers import Dense
# Initialising the ANN
classifier = Sequential()
# Adding the input layer and the first hidden layer // input dim for input layer
classifier.add(Dense(activation="relu", input_dim=17, units=6, kernel_initializer="uniform"))
# Adding the second hidden layer
classifier.add(Dense(activation="relu", units=6, kernel_initializer="uniform"))
Problem here
# units = categories, softmax = more than 2
classifier.add(Dense(activation="softmax", units=4, kernel_initializer="uniform"))
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 27, nb_epoch = 100)
# Part 3 - Making the predictions and evaluating the model
# Predicting the Test set results
y_pred = classifier.predict(X_test)
y_pred = (y_pred > 0.5)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
With multiclass task, your y_train and y_test have to be one-hot-encoded. That means they must have dimension (number_of_samples, 4), where 4 denotes the number of classes.
You need to apply tensorflow.keras.utils.to_categorical to them before train-test splitting.
y = to_categorical(y, 4)
ref: https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical

Neural network having suspiciously bad results

I am playing around with the Keras library, trying to predict a timeserie and getting very bad results, I would like to know why the neural network can't handle even a simple scenario. My (engineered) data look like this:
(The pattern is very simple - result has exactly the same value as the feature, there are 10000 lines like this)
dataPointIndex,feature,result
0, 1, 1
1, 1, 1
2, 0, 0
3, 1, 1
4, 1, 1
5, 1, 1
6, 1, 1
7, 0, 0
8, 1, 1
9, 0, 0
10, 1, 1
...
My Keras code:
TIMESERIES_LENGTH = 10
TIMESERIES_SAMPLING_RATE = 1
TIMESERIES_BATCH_SIZE = 16
TEST_SET_RATIO = 0.2
VALIDATION_SET_RATIO = 0.2
data = pd.read_csv("data/" + csv_path)
x = data.ix[:, 1:2]
y = data.ix[:, 2]
test_set_length = int(round(len(x) * TEST_SET_RATIO))
validation_set_length = int(round(len(x) * VALIDATION_SET_RATIO))
x_train_and_val = x[:-test_set_length]
y_train_and_val = y[:-test_set_length]
x_train = x_train_and_val[:-validation_set_length].values
y_train = y_train_and_val[:-validation_set_length].values
x_val = x_train_and_val[-validation_set_length:].values
y_val = y_train_and_val[-validation_set_length:].values
x_test = x[-test_set_length:].values
y_test = y[-test_set_length:].values
scaler = sklearn.preprocessing.StandardScaler().fit(x_train_and_val)
train_gen = keras.preprocessing.sequence.TimeseriesGenerator(
x_train,
y_train,
length=TIMESERIES_LENGTH,
sampling_rate=TIMESERIES_SAMPLING_RATE,
batch_size=TIMESERIES_BATCH_SIZE
)
val_gen = keras.preprocessing.sequence.TimeseriesGenerator(
x_val,
y_val,
length=TIMESERIES_LENGTH,
sampling_rate=TIMESERIES_SAMPLING_RATE,
batch_size=TIMESERIES_BATCH_SIZE
)
test_gen = keras.preprocessing.sequence.TimeseriesGenerator(
x_test,
y_test,
length=TIMESERIES_LENGTH,
sampling_rate=TIMESERIES_SAMPLING_RATE,
batch_size=TIMESERIES_BATCH_SIZE
)
model = keras.models.Sequential()
model.add(keras.layers.Dense(100, activation='relu', input_shape=(TIMESERIES_LENGTH, 1)))
model.add(keras.layers.Flatten())
model.add(keras.layers.Dense(1000, activation='relu'))
model.add(keras.layers.Dense(1, activation='sigmoid'))
model.compile(
loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy']
)
history = model.fit_generator(
train_gen,
epochs=20,
verbose=1,
validation_data=val_gen
)
plt.plot(history.history['acc'])
plt.plot(history.history['val_acc'])
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.ylabel('accuracy/loss')
plt.xlabel('epoch')
plt.legend(['training accuracy', 'validation accuracy', 'training loss', 'validation loss'], loc='upper left')
plt.show()
Results:
I have tried LSTM layers, but they perform similarly badly.
Any idea what am I doing wrong? Thank you very much.
It turns out keras.preprocessing.sequence.TimeseriesGenerator expects the y (y_train in my example) to be shifted by one compared to X (x_train in my case).
Your input data should be in such a shape that the particular subsequence of X ending at index n predicts value at index n + 1 in your y. My original mistake was that it predicted value at index n.
Thanks to Daniel Möller, for pointing me in the right direction.
What is the mean value of the target data? Is it zero? From my experience, the default configuration of NN does not have a constant value which can be obtained by having last layer with affine or linear activation function.

Good accuracy on validation and test but bad predictions keras lstm

I'm having trouble with LSTM and Keras.
I try to predict normal/fake domain names.
My dataset is like this:
domain,fake
google, 0
bezqcuoqzcjloc,1
...
with 50% normal and 50% fake domains
Here's my LSTM model:
def build_model(max_features, maxlen):
"""Build LSTM model"""
model = Sequential()
model.add(Embedding(max_features, 128, input_length=maxlen))
model.add(LSTM(64))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))
sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['acc'])
return model
then I preprocess my text data to transform it into numbers:
"""Run train/test on logistic regression model"""
indata = data.get_data()
# Extract data and labels
X = [x[1] for x in indata]
labels = [x[0] for x in indata]
# Generate a dictionary of valid characters
valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X)))}
max_features = len(valid_chars) + 1
maxlen = 100
# Convert characters to int and pad
X = [[valid_chars[y] for y in x] for x in X]
X = sequence.pad_sequences(X, maxlen=maxlen)
# Convert labels to 0-1
y = [0 if x == 'benign' else 1 for x in labels]
Then I split my data into training, testing and validation sets:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
print("Build model...")
model = build_model(max_features, maxlen)
print("Train...")
X_train, X_holdout, y_train, y_holdout = train_test_split(X_train, y_train, test_size=0.2)
And then I train my model on training data and validation data, and evaluate on test data:
history = model.fit(X_train, y_train, epochs=max_epoch, validation_data=(X_holdout, y_holdout), shuffle=False)
scores = model.evaluate(X_test, y_test, batch_size=batch_size)
At the end of my training/testing I have these results:
And these scores when evaluating on test dataset:
loss = 0.060554939906234596
accuracy = 0.978109902033532
However when I predict on a sample of the dataset like this:
LSTM_model = load_model('LSTMmodel_64_sgd.h5')
data = pickle.load(open('traindata.pkl', 'rb'))
#### LSTM ####
"""Run train/test on logistic regression model"""
# Extract data and labels
X = [x[1] for x in data]
labels = [x[0] for x in data]
X1, _, labels1, _ = train_test_split(X, labels, test_size=0.9)
# Generate a dictionary of valid characters
valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X1)))}
max_features = len(valid_chars) + 1
maxlen = 100
# Convert characters to int and pad
X1 = [[valid_chars[y] for y in x] for x in X1]
X1 = sequence.pad_sequences(X1, maxlen=maxlen)
# Convert labels to 0-1
y = [0 if x == 'benign' else 1 for x in labels1]
y_pred = LSTM_model.predict(X1)
I have very poor performance:
accuracy = 0.5934741842730341
confusion matrix = [[25201 14929]
[17589 22271]]
F1-score = 0.5780171295094731
Can someone explain to me why?
I have tried 64 instead of 128 for the LSTM node, adam and rmsprop for optimizers, increasing batch_size however performance remains very low.
Ok so I have found the answer.
This is this line
valid_chars = {x:idx+1 for idx, x in enumerate(set(''.join(X1)))}
In Python 3 set seems to produce different results everytime a new python3 console is open.
So running the code in Python 2 has resolved my issues !

Categories