How to fix class imbalance in dialogue (text) time series data?

How to fix class imbalance in dialogue (text) time series data? - python

I have a dataset that looks like this:
df.head(5)
data labels
0 [0.0009808844009380855, 0.0008974465127279559] 1
1 [0.0007158940267629654, 0.0008202958833774329] 3
2 [0.00040971929722210984, 0.000393972522972382] 3
3 [7.916243163372941e-05, 7.401835468434177e243] 3
4 [8.447556379936086e-05, 8.600626393842705e-05] 3
The 'data' column is my X and the labels is y. The df has 34890 rows. Each row contains 2 floats. The data represents a bunch of sequential text and each observation is a representation of a sentence. There are 5 classes.
I am training it on this LSTM code:
data = df.data.values
labels = pd.get_dummies(df['labels']).values
X_train, X_test, y_train, y_test = train_test_split(data,labels, test_size = 0.10, random_state = 42)
X_train = X_train.reshape((X_train.shape[0],1,X_train.shape[1])) # shape = (31401, 1, 5)
X_test = X_test.reshape((X_test.shape[0],1,X_test.shape[1])) # shape = (3489, 1, 5)
### y_train shape = (31401, 5)
### y_test shape = (3489, 5)
### Bi_LSTM
Bi_LSTM = Sequential()
Bi_LSTM.add(layers.Bidirectional(layers.LSTM(32)))
Bi_LSTM.add(layers.Dropout(.5))
# Bi_LSTM.add(layers.Flatten())
Bi_LSTM.add(Dense(11, activation='softmax'))
def compile_and_fit(history):
history.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
history = history.fit(X_train,
y_train,
epochs=30,
batch_size=32,
validation_data=(X_test, y_test))
return history
LSTM_history = compile_and_fit(Bi_LSTM)
The model trains, but the val accuracy is fixed at 53% for every epoch. I am assuming this is because of my class imbalance problem (1 class takes up ~53% of the data, the other 4 are somewhat evenly distributed throughout the remaining 47%).
How do I balance my data? I am aware of typical over/under sampling techniques on non-time series data, but I can't over/under sample because that would mess with the sequential time-series nature of the data. Any advice?
EDIT
I am attempting to use the class_weight argument in Keras to address this. I am passing this dict into the class_weight argument:
class_weights = {
0: 1/len(df[df.label == 1]),
1: 1/len(df[df.label == 2]),
2: 1/len(df[df.label == 3]),
3: 1/len(df[df.label == 4]),
4: 1/len(df[df.label == 5]),
}
Which I am basing off of this recommendation:
https://stats.stackexchange.com/questions/342170/how-to-train-an-lstm-when-the-sequence-has-imbalanced-classes
However, the acc/loss is now really awful. I get ~30% accuracy with a dense net, so I expected the LSTM to be an improvement. See acc/loss curves below:

Keras/Tensorflow enable to use class_weight or sample_weights in model.fit method
class_weight affects the relative weight of each class in the calculation of the objective function. sample_weights, as the name suggests, allows further control of the relative weight of samples that belong to the same class
class_weight accepts a dictionary where you compute the weights of each class while sample_weights receive a univariate array of dim == len(y_train) where you assign specific weight to each sample

Related

Passing dataframe to keras sequential model

I'm trying to build and train a simple MLP model using keras.Sequential().
However, I'm having issues when, after each training epoch, I try to evaluate the current status of the model on the train and test data.
I'm having this problem on a couple different datasets, one of them is the "CAR DETAILS FROM CAR DEKHO" dataset, you can find it here
This is what I'm doing so far:
import numpy as np
import tensorflow as tf
import pandas as pd
def main()
## read, preprocess and split data
df_data = pd.read_csv('car_data_CAR_DEKHO.csv')
df_data = pre_process(df_data)
X_train, y_train, X_test, y_test = split_data(df_data) ## -> these are PANDAS DATAFRAMES!
train(X_train, X_test, y_train, y_test)
def train(X_train, X_test, y_train, y_test):
##--------------------
## building model
##--------------------
batch = 5000
epochs = 500
lr = 0.001
data_iter = load_array((X_train, y_train), batch)
initializer = tf.initializers.RandomNormal(stddev=0.01)
net = tf.keras.Sequential()
net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer))
loss = tf.keras.losses.MeanSquaredError()
trainer = tf.keras.optimizers.SGD(learning_rate=lr)
##--------------#
## training #
##--------------#
for epoch in range(1, epochs + 1):
for X_batch, Y_batch in data_iter:
with tf.GradientTape() as tape:
l = loss(net(X_batch, training=True), Y_batch)
grads = tape.gradient(l, net.trainable_variables)
trainer.apply_gradients(zip(grads, net.trainable_variables))
# test on train set after epoch
y_pred_train = net(X_train) ## ERROR HERE!!!
l_train = loss(y_pred_train, y_train)
y_pred_test = net(X_test)
l_test = loss(y_pred_test, y_test)
def load_array(data_arrays, batch_size, is_train=True):
"""Construct a TensorFlow data iterator."""
dataset = tf.data.Dataset.from_tensor_slices(data_arrays)
if is_train:
dataset = dataset.shuffle(buffer_size=1000)
dataset = dataset.batch(batch_size)
return dataset
def split_data(df_data):
X = df_data.copy()
y = X.pop('selling_price')
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
return X_train, y_train, X_test, y_test
def pre_process(df_data):
## check NaNs and drop rows if any
print(df_data.isnull().sum())
df_data.dropna(inplace=True)
## drop weird outlier, turns out it has 1 km_driven
df_data.drop([1312], inplace=True)
## features engineering
df_data['name'] = df_data['name'].map(lambda x: x.split(' ')[0])
df_data['owner'] = df_data['owner'].map(lambda x: x.split(' ')[0])
df_data['selling_price'] = df_data['selling_price']/1000
df_data_dummies = pd.get_dummies(df_data, drop_first=True)
df_data_dummies = normalize(df_data_dummies) ## this is a simple min-max scaling, I do it manually but you can use sklearn or something similar
return df_data_dummies
def normalize(df):
print('Data normalization:')
result = df.copy()
for feature_name in df.columns:
if feature_name == 'selling_prize':
pass
else:
max_value = df[feature_name].max()
min_value = df[feature_name].min()
result[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
if result[feature_name].isnull().values.any():
result.drop([feature_name], axis=1, inplace=True)
print(f'Something wrong in {feature_name}, dropped.')
print(f'now shape is {len(result)}, {len(result.columns)}')
print(f'\treturning {len(result)}, {len(result.columns)}')
return result
and I'm getting the error:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 232, in assert_input_compatibility
ndim = x.shape.rank
AttributeError: 'tuple' object has no attribute 'rank'
I guess the error is due to me passing X_train (which is a dataframe) directly to net.
I also tried using again:
y_pred_train = net(tf.data.Dataset.from_tensor_slices(X_train))
like when creating training batches, but it returns another error:
File "/home/lews/anaconda3/envs/tf/lib/python3.8/site-packages/tensorflow/python/keras/engine/input_spec.py", line 201, in assert_input_compatibility
raise TypeError('Inputs to a layer should be tensors. Got: %s' % (x,))
TypeError: Inputs to a layer should be tensors. Got: <TensorSliceDataset shapes: (19,), types: tf.float64>
Finally, I tried using:
y_pred_train = net.predict(X_train)
the weird thing in this case is that I got an OOM error, referring to a tensor with shape[76571,76571]:
File "<string>", line 3, in raise_from
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[76571,76571] and type double on /job:localhost/replica:0/task:0/device:CPU:0 by allocator cpu [Op:SquaredDifference]
but the X_train datagrame has shape (76571, 19), so I don't understand what is happening.
What is the correct way to do this?

Your code mostly looks OK, the issue must be with the data that you pass.
Check content and datatypes of the data that you feed.
Try converting pandas slices into np.arrays, re-check their dimensions and then feed np.arrays to load_array().
Also try smaller batches, like 64 (not 5000).
UPDATE:
Apparently when you pass X_batch to the model you pass tf.tensor, but later when you pass whole X_train or X_test - you pass pd.DataFrames and the model gets confused.
You should change just 2 lines:
y_pred_train = net(tf.constant(X_train)) # pass TF.tensor - best
#alternative:
y_pred_train = net(X_train.values) # pass np.array - also good
y_pred_test = net(tf.constant(X_test)) # make similar change here

The issue looks like it is related to the data (as Poe Dator says). What I believe is going on is that your network has some input shape based on the batches of data it is receiving. Then when you are trying to predict or call your network on the data (this also recomputes shapes since it calls the build() function), it tries to get the data into the shape it expects. I think specifically it is expecting a shape of (batch, 1, 19) but with your data in (76571, 19) it is not finding the correct shape.
A couple of easy steps to work on this would be:
Call net.summary() to see what the shapes it believes it is getting before and after training
Provide the input shape to the first layer, net.add(tf.keras.layers.Dense(1, kernel_initializer=initializer, input_shape=(1, 19)))
Slice your X data in the same shape as your training data.
Add a dimension to your data so it is (76571, 1, 19) to explicitly shape it as well.
Also as noted above, smaller batch sizes would be best. I would also recommend using the model.train() method instead of handling gradients if you don't have a lot of experience with tensorflow. This saves you code and is easier to ensure you are handling your model correctly during training.

Binary time series forecasting with LSTM in python

Hello I am working with binary time series of expression data as follows:
0: decrease expression
1: increase expression
I am training a Bidirectional LSTM network to predict the next value, but instead of giving me values of 0 or 1, it returns values like:
0.564
0.456
0.423
0.58
How can I get it to return 0 or 1?
this is my code:
ventana = 10
n_features = 1
neurons = 256 #155
activacion = 'softmax'
perdida = 0.25
batch_size = 32 # 32
epochs = 100 # 200
X, y = split_sequence(cierres, ventana)
X = X.reshape((X.shape[0], X.shape[1], n_features))
# define model
model = Sequential()
model.add(Bidirectional(LSTM(neurons, activation=activacion), input_shape=(ventana, n_features)))
model.add(Dropout(perdida))
model.add(Dense(1))
model.compile(optimizer='adam', loss='binary_crossentropy')
# fit model
model.fit(X, y, epochs=epochs, batch_size=batch_size, verbose=1, shuffle=False)

The network is effectively performing a regression on the data, and doesn't give an exact 0 or 1. By giving a number in between, it is producing something of a degree of confidence, with number closer to 1 being more confidently a 1. To transform this, you can apply thresholding, where you round the output to 0 or 1.
import numpy as np
y_out = model.fit(...)
y_pred = np.round(y_out)
That being said, this doesn't actually minimize some kinds of loss functions. If you are being scored on a function like MSE, it is better to keep the numbers as they are.

LSTM time series forcasting - starting with low loss, and accuracy does not change

I'm trying to predict network traffic based on past values. I built an LSTM network, and tried several parameters, however I always end up with the same very low accuracy (0.108).
scaler = MinMaxScaler(feature_range = (0, 1))
dataset = scaler.fit_transform(dataset)
train_size = int(len(dataset) * 0.67)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
print(len(train), len(test))
def create_dataset(dataset, window_size = 1):
data_X, data_Y = [], []
for i in range(len(dataset) - window_size - 1):
a = dataset[i:(i + window_size), 0]
data_X.append(a)
data_Y.append(dataset[i + window_size, 0])
return(np.array(data_X), np.array(data_Y))
window_size = 1
train_X, train_Y = create_dataset(train, window_size)
test_X, test_Y = create_dataset(test, window_size)
print("Original training data shape:")
print(train_X.shape)
# Reshape the input data into appropriate form for Keras.
train_X = np.reshape(train_X, (train_X.shape[0], 1, train_X.shape[1]))
test_X = np.reshape(test_X, (test_X.shape[0], 1, test_X.shape[1]))
model = Sequential()
model.add(LSTM(4, input_shape = (1, window_size)))
model.add(Dense(1))
opt = optimizers.SGD(lr=0.01, momentum=0.9)
model.compile(loss = "mean_squared_error", optimizer = opt, metrics = ['accuracy'])
As you can see my loss starts from quite a low value, and my accuracy constant over time. What am I doing wrong?
Thanks in advance. :)
You can find the loss and accuracy graph here:
loss
accuracy

Try to randomly shuffle the data.
LSTMs are best used when sequential data is used. Try to replace LSTM with the Dense layer. Or change your inputs. You need to pass a sequence of past values to the LSTM so that it can predict the next value. So, (1,1) is not a sequence, LSTM is not useful here.
Metric accuracy is useless in this context, try mean absolute error or something.

If (13942, 1, 1) is your entire dataset, it's far too small for deep learning; you're better off using 'shallow' methods, e.g. Support Vector Machines (SVM). Alternatively, consider my answer here.
EDIT: I just noticed that you use the accuracy metric; accuracy for regression is undefined - I'm surprised an error wasn't thrown. If it uses prediction==true to compute accuracy, you're lucky your accuracy isn't 0. Based on the loss, your model appears to be actually doing quite well; to double-check, plot predictions vs true and compare. (In general, mse < .3 is good, and mse < .1 is excellent)

I wonder if I was right about the implement of lstm layer using keras

Here is my model definition:
model = Sequential()
model.add(LSTM(i, input_shape=(None, 1), return_sequences=True))
model.add(Dropout(l))
model.add(LSTM(j))
model.add(Dropout(l))
model.add(Dense(k))
model.add(Dropout(l))
model.add(Dense(1))
and here is result
p = model.predict(x_test)
plt.plot(y_test)
plt.plot(p)
The sequential input represents the past signal in previous time-steps, the output is predicting the signal in next time-step. After splitting the training and testing data, the predictions on the test data is as follows:
The figure shows almost a perfect match with gold test data and the predictions. Is it possible to predict with such high accuracy?
I think something is wrong because there's no volatility. So I wonder if it's been implemented properly.
If the implementation is correct, how can you get the following(next) value?
Is it right to do this implement?
a = x_test[-1]
b = model.predict(a)
c = model.predict(b)
...
To sum up the question:
Is the implementation right way?
I wonder how to get the value of the next data.
def create_dataset(signal_data, look_back=1):
dataX, dataY = [], []
for i in range(len(signal_data) - look_back):
dataX.append(signal_data[i:(i + look_back), 0])
dataY.append(signal_data[i + look_back, 0])
return np.array(dataX), np.array(dataY)
train_size = int(len(signal_data) * 0.80)
test_size = len(signal_data) - train_size - int(len(signal_data) * 0.05)
val_size = len(signal_data) - train_size - test_size
train = signal_data[0:train_size]
val = signal_data[train_size:train_size+val_size]
test = signal_data[train_size+val_size:len(signal_data)]
x_train, y_train = create_dataset(train, look_back)
x_val, y_val = create_dataset(val, look_back)
x_test, y_test = create_dataset(test, look_back)
I use create_dataset with look_back=20.
signal_data is preprocessed with min-max normalisation MinMaxScaler(feature_range=(0, 1)).

Is the implementation right way?
Your code seems correct. I think you are not getting surprising results. You need to compare the results with a baseline that next prediction is randomly sampled from the range of possible day-to-day change. This way at least you can understand if your model is doing better than random sampling.
delta_train = train[1][1:] - train[1][:-1]
delta_range_train = delta_train.max()-delta_train.min()
# generating the baseline based on the change range in training:
random_p = test[0][:, -1] + (np.random.rand(test[0].shape[0])-0.5)*delta_range_train
You can check if your results are better than just a random sample random_p.
I wonder how to get the value of the next data.
this gives you the last data point in the test set:
a = x_test[-1:]
then, here you are predicting the next point day:
b = model.predict(a)
based on look_back you may need to keep some of the datapoints from to predict the next-next point:
c = model.predict(np.array([list(a[0,1:])+[b]])

a simple single layer perceptron implementation for binary classification with sigmoid activation function

Hello i m trying to complete an assignment based on training a perceptron (without any hidden layer) to perform binary classification using sigmoid activation function. but due to some reason my code is not working correctly. although the error is decreasing after each epoch but accuracy is not increasing. i have target labels 1 and 0, but my predicted labels are almost all close to one. none of my predicted label is representing the 0 class.
below is my code. anyone please tell me what have i done wrong.
<# Create a Neural_Network class
class Neural_Network(object):
def __init__(self,inputSize = 2,outputSize = 1 ):
# size of layers
self.inputSize = inputSize
self.outputSize = outputSize
#weights
self.W1 = 0.01*np.random.randn(inputSize+1, outputSize) # randomly initialize W1 using random function of numpy
# size of the wieght will be (inputSize +1, outputSize) that +1 is for bias
def feedforward(self, X): #forward propagation through our network
n,m=X.shape
Xbias = np.ones((n,1)) #bias term in input
Xnew = np.hstack((Xbias,X)) #adding biasterm in input to match the dimension with the weigth
self.product=np.dot(Xnew,self.W1) # dot product of X (input) and set of weights
output=self.sigmoid(self.product) # apply activation function (i.e. sigmoid)
return output # return your answer with as a final output of the network
def sigmoid(self, s):# apply sigmoid function on s and return its value
return (1./(1. + np.exp(-s))) #activation sigmoid function
def sigmoid_derivative(self, s):#derivative of sigmoid
#derivative of sigmoid = sigmoid(x)*(1-sigmoid(x))
return s*(1-s) # here s will be sigmoid(x)
def backwardpropagate(self,X, Y, y_pred, lr):
# backward propagate through the network
# compute error in output which is loss, compute cross entropy loss function
self.output_error=self.crossentropy(Y,y_pred) #output error
# applying derivative of sigmoid to the error
self.error_deriv=self.output_error*self.sigmoid_derivative(y_pred)
# adjust set of weights
n,m=X.shape
Xbias = np.ones((n,1)) #bias term in input
Xnew = np.hstack((Xbias,X)) #adding biasterm in input to match the dimension with the weigth
self.W1 += lr*(Xnew.T.dot(self.error_deriv)) # W1=W1+ learningrate*errorderiv*input
#self.W1 += X.T.dot(self.z2_delta)
def crossentropy(self, Y, Y_pred):
# compute error based on crossentropy loss
#Cross entropy= sum(Y_actual*log(y_predicted))/N. here 1e-6 is used to avoid log 0
N = Y_pred.shape[0]
#cr_entropy=-np.sum(((Y*np.log(Y_pred+1e-6))+((1-Y)*np.log(1-Y_pred+1e-6))))/N
cr_entropy=-np.sum(Y*np.log(Y_pred+1e-6))/N
return cr_entropy #error
Null=None
def train(self, trainX, trainY,epochs = 100, learningRate = 0.001, plot_err = True ,validationX = Null, validationY = Null):
tr_error=[]
for i in range(epochs):
# feed forward trainX and trainY and recievce predicted value
y_predicted=self.feedforward(trainX)
print(i,y_predicted)
# backpropagation with trainX, trainY, predicted value and learning rate.
self.backwardpropagate(trainX,trainY,y_predicted,learningRate)
tr_error.append(self.output_error)
print(i,self.output_error)
print(i,self.W1)
# """"""if validationX and validationY are not null than show validation accuracy and error of the model.""""""
# plot error of the model if plot_err is true
epocharray=range(0,epochs)
plt.plot(epocharray,tr_error,'r',linewidth=3.0) #plotting error vs. no. of epochs
plt.xlabel('No. of Epochs')
plt.ylabel('Cross Entropy Error')
plt.title('Error Vs. Epoch')
def predict(self, testX):
# predict the value of testX
self.ytest_pred=self.feedforward(testX)
def accuracy(self, testX, testY):
import math
# predict the value of trainX
self.ytest_pred1=self.feedforward(testX)
acc=0
# compare it with testY
for j in range(len(testY)):
q=math.ceil(self.ytest_pred1[j])
#p=round(q)
if testY[j] == q:
acc +=1
accuracy=acc/float(len(testX))*100
print("Percentage Accuracy is", accuracy,"%")
# compute accuracy, print it and """"""show in the form of picture""""""
return accuracy # return accuracy>
# generating dataset point
np.random.seed(1)
no_of_samples = 2000
dims = 2
#Generating random points of values between 0 to 1
class1=np.random.rand(no_of_samples,dims)
#To add separability we will add a bias of 1.1
class2=np.random.rand(no_of_samples,dims)+1.1
class_1_label=np.array([1 for n in range(no_of_samples)])
class_2_label=np.array([0 for n in range(no_of_samples)])
#Lets visualize the dataset
plt.scatter(class1[:,0],class1[:,1], marker='^', label="class 1")
plt.scatter(class2[:,0],class2[:,1], marker='o', label="class 2")
plt.xlabel('Dimension 1')
plt.ylabel('Dimension 2')
plt.legend(loc='best')
plt.show()
from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split
# Data concatenation
data = np.concatenate((class1,class2),axis=0)
label = np.concatenate((class_1_label,class_2_label),axis=0)
#Note: shuffle this dataset before dividing it into three parts
data,label=shuffle(data,label)
#print(data)
# now using train_test_split command to split data into 60% training data, 20% testing data and 20% validation data
trainX, testX, trainY, testY = train_test_split(data, label, test_size=0.2, random_state=1)
trainX, validX, trainY, validY = train_test_split(trainX, trainY, test_size=0.25, random_state=1)
model = Neural_Network(2,1)
# try different combinations of epochs and learning rate
model.train(trainX, trainY, epochs = 100, learningRate = 0.000001, validationX = validX, validationY = validY)
model.accuracy( testX,testY)
the Results are coming like this(no label going near 0)
0 [[0.49670809]
[0.4958389 ]
[0.4966064 ]
...
[0.49537492]
[0.49566927]
[0.4961255 ]]
0 828.1069658303942
0 [[0.48311074]
[0.51907406]
[0.52764299]]
1 [[0.69813116]
[0.91746189]
[0.80408611]
...
[0.74821077]
[0.87150079]
[0.75187736]]
1 250.96538025031356
1 [[0.56983781]
[0.59205773]
[0.60057486]]
2 [[0.72602796]
[0.94067579]
[0.83591236]
...
[0.77916283]
[0.90032058]
[0.78291184]]
2 210.645081151866
2 [[0.63353102]
[0.64265939]
[0.65118627]]
3 [[0.74507968]
[0.95318096]
[0.85588864]
...
[0.79953834]
[0.91705918]
[0.80329027]]
3 186.2933734713245
3 [[0.6846678 ]
[0.68164316]
[0.69020355]]
4 [[0.75952936]
[0.96114086]
[0.87010085]
...
[0.81456476]
[0.92830628]
[0.81829009]]
4 169.32091332021724
4 [[0.72771826]
[0.71342293]
[0.72202744]]
5 [[0.77112943]
[0.96669774]
[0.88093323]
...
[0.82635507]
[0.93649788]
[0.83004119]]
5 156.53923256347372
Please help me to solve this problem

I see you have set learning rate too small. Set it to 0.001 and Increase epoch to 20k and you will see your model learning well.
Plotting error vs epoch's should give you better idea where to stop.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fix class imbalance in dialogue (text) time series data? - python

Related

Passing dataframe to keras sequential model

Binary time series forecasting with LSTM in python

LSTM time series forcasting - starting with low loss, and accuracy does not change

I wonder if I was right about the implement of lstm layer using keras

a simple single layer perceptron implementation for binary classification with sigmoid activation function

Categories

Resources