How can I use RNN for classification of variable length sequence? - python

I have a dataset of which have variable length sentences in each sample. I know this seems to be a very basic question which one should be able to solve just by googling. But I am really new to it and I did whatever search I was able to do.
I came up with some solutions which don't involve padding or truncation.
First was to convert each sample (in form of strings) to a sequence of vectors. I got "Value error: Unable to convert to tensor" errors. I also tried convert_to_tensor method but still same error. The type of each sample was list of numpy arrays; where length of list varied from sample to sample (based on number of tokens in sentence. Model tried for same is below:
inputs = keras.Input(shape=(None, emb_size))
gru_out = keras.layers.Bidirectional(keras.layers.GRU(gru_size, return_sequences=False))(inputs)
gru_out = keras.layers.Flatten()(gru_out)
predictions = keras.layers.Dense(num_classes, activation='sigmoid')(gru_out)
model = keras.Model(inputs=inputs, outputs=predictions)
model.compile(optimizer=keras.optimizers.Adam(), loss='binary_crossentropy', metrics=['accuracy'])
Second Convert the tokenized words to tokens (integers) and feed it to the model which already have embedding weights embedded in embedding layer.
model = tf.keras.Sequential()
model.add(tf.keras.layers.Embedding(input_dim=len(vocab), output_dim=emb_size, weights=[embedding_matrix], trainable=False))
model.add(tf.keras.layers.GRU(32, return_sequences=True))
model.add(tf.keras.layers.GRU(32, return_sequences=True))
model.add(tf.keras.layers.GRU(32))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(1, activation='sigmoid'))
Here the type of embedding matrix was list of tensorflow tensors. Input was pandas series of samples where each sample was numpy array if integers (tokens). labels in both these cases were in form of Series of integers. This method gave error "ValueError: Data cardinality is ambiguous: Make sure all arrays contain the same number of samples.", I got confused by this error because I have read somewhere that GRU cells can be used to train models with variable size input.
There are other methods like padding sequences to some length before feeding to model. I was able to make it work but I don't want to use padding or truncation.
Another method which worked was when I used a model with text vectorization layer and embedding layerand fed the model directly the sentences. But, here I felt a logical gap as I needed to adapt the vectorizer but if a new word comes while prediction the integers assigned for every word will get changed making everything useless. For out of vocabulary word embdding vector is not a problem since I am using pretrained model for obtaining vector.
I don't know why it should be so tricky for sucha simple task to do. I must be doing something very stupid, kindly guide me out.

Related

How to build an end-to-end NLP RNN classification model?

I have trained a NLP model with RNN on keras to classify tweets with word embeddings (Stanford GloVe) used as a feature selection method. I would like to apply this model trained onto new tweets extracted. However, this error appears when I try to apply the model to new data.
ValueError: Error when checking input: expected input_1 to have shape (22,) but got array with shape (51,)
Then I realised that the model trained is expecting an input with a 22-input vector (the max tweet length in the training set tweets). On the other hand, the new dataset I would like to apply the model to has a 51-input vector (the max tweet length in the new dataset tweets).
In attempt to tackle this, I increased the size of the vector when training the model to 51 so both would be balanced. A new error pops up:
InvalidArgumentError: indices[29,45] = 5870 is not in [0, 2489)
Thus, I decided to try to apply the model back on the training dataset to see if it was possible in the first place with the original parameters and model. It was not and a very similar error appeared.
InvalidArgumentError: indices[23,11] = 2489 is not in [0, 2489)
In this case, how can I export an end-to-end NLP RNN classification model to apply on new unseen data? (FYI: I was able to successfully to do this for Logistic Regression with TF-IDF used as a feature selection. There just seems to be a lot of issues with the Word Embeddings.)
===========
UPDATE:
I was able to solve this issue by pickling not only the model, but also variables such as the max_len, texttotensor_instance and tokenizer. When applying the model to new data, I will have to use the same variables generated from the training data (instead of redefining them with the new data).
Your error is because the maximum number of words in your training data exceeds the max in the Embeddings layer (aka. input_dim).
It seems that the input_dim param. in your Embeddings layer is set to 2489, where you have words in your dataset tokenized and mapped to a higher value (5870).
Also don't forget to add one to the maximum # of words when you set this in the Embedding layer (input_dim=max_number_of_words+1). If you're interested to know why check this question: Keras embedding layer masking. Why does input_dim need to be |vocabulary| + 2?

Neural Network with several inputs (keras, text classification)

I am new to machine learning and experimented a bit with neural networks and did some research.
I am currently trying to make a mini network for fake news detection.
My data has several features (statement,speaker,date,topic..), so far I've tried using simply the text of the false and true statements as input for my network and used glove for word embeddings. I tried out the following network:
model = tf.keras.Sequential(
[
# part 1: word and sequence processing
tf.keras.layers.Embedding(embeddings_matrix.shape[0], embeddings_matrix.shape[1], weights=[embeddings_matrix], trainable=True),
tf.keras.layers.Conv1D(128, 5, activation='relu'),
tf.keras.layers.GlobalMaxPooling1D(),
tf.keras.layers.Dropout(0.2),
# part 2: classification
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
It gives me over 90% training accuracy and around 65% testing accuracy.
So now i want to try adding 2 more features, the speaker, which is given as [firstname,lastname] and the topic, which can be just one or several words (e.g vaccines or politics-elections-2016), which i decided to limit/pad to 5 words.
Now i don't know how to combine those features into one model. I don't think using word embeddings on the other features makes sense. Do i have to make 3 different networks and concatenate them into one? can I use both the speaker and topic as inputs for the same network? if i concatenate them, how would i do so (using the already classified output for each as input?) ?
I have in the past concatenated different input types like this. You can’t use Sequential anymore, the docs say
A Sequential model is not appropriate when:
• Your model has multiple inputs or multiple outputs
• Any of your layers has multiple inputs or multiple outputs
In the Keras.functional docs, there is a section called “Models with multiple inputs and outputs” that is almost the exact problem you are asking about.
However, I wrote an answer to say: don’t rule out turning your structured data into text, such as prepending “xxspeaker firstname lastname xxtopic topicname” to the text you currently have. It might not work for your current model, which seems pretty small... but if you were to be using a larger model, or fine-tuning a large LM for the task, like if you were using fast.ai or huggingface, you almost certainly have the capacity to just learn it from text.

Most suitable Machine Learning algorithm for this problem?

I have a dataset and i want to decide on which ML algorithm to apply to my given problem.
Customers are to fill out an assessment questionnaire of 50 questions. Examples of the questions are, what is your job, previous job history, how much do you earn, have you been rejected for a loan etc, and the end goal is to decide whether they should be rejected or not.
I have circa 500 entries for my algorithm to learn from and have pre-processed my dataset and converted the inputs into a numpy array and wondering what would be the best algorithm to use? Should i use a classification algorithm or a neural network in tensorflow and if the latter, what would be the layers I should use?
Thanks
How about beginning with xgboost or random forest? - So plain "old" ML?
The advantage would be that you could visualize the decision tree of the model once trained.
If using a NN in tensorflow (or even easier: keras with tensorflow backend), you could go with a MLP (multi layer perceptron), since the questions answers have fixed position in the input. You don't need many layers.
Important is that you normalize your input data columnwise, so that the input numbers are not much bigger/smaller than +1/-1, respectively. Introductory books often miss this point, though important.
Since your target labeling is "accept" or "reject", binary classifier will do it (also in the machine learning approach). (You use 0 and 1 as labels).
For NN, you don't need for such kind of classification that many layers or neurons. Try the smallest network first. let's say 10 neurons in first layer, then 7 neurons in the next layer (probably even less) and then 1 output neuron for the binary decision.
With Keras this would be:
from keras.models import Sequential
from keras.layers import Dense
def create_mlp(n_input = 500): # number of columns of input data 500 here
model = Sequential()
model.add(Dense(10, input_dim=n_input, kernel_initializer='normal', activation='relu')) # init = kernel_initializer
model.add(Dense(7, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['acc'])
return model
model = create_mlp(500) # this will generate the correct NN compiled.
Your data frame (or Numpy input array must have as rows the samples,
the columns are for each answer for a question 1 column.
the answers you have to encode in a numeric form. Numbers should be small - the best between -1 and 1. NNs don't like big numbers. Thus column-wise normalization can help.)
That's it. I learned all this stuff last year. Good luck for learning. It will be tons of fun!

Adding LSTM layers in Keras produces input error

First of all apologies if I word this wrong, I'm relatively new to TensorFlow
I am designing a model for simple classification of a dataset, each column is an attribute and the final column is the class. I split these and generate a dataframe in the usual way.
If I generated a model with dense layers, it works great:
def baseline_model():
# create model
model = Sequential()
model.add(Dense(30, input_dim=len(dataframe.columns)-2, activation='sigmoid'))
model.add(Dense(20,activation='sigmoid'))
model.add(Dense(unique, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
return model
If I were to add, say, an LSTM layer to the model:
def baseline_model():
# create model
model = Sequential()
model.add(Dense(30, input_dim=len(dataframe.columns)-2, activation='sigmoid'))
#this bit here >
model.add(LSTM(20, return_sequences=True))
model.add(Dense(20,activation='sigmoid'))
model.add(Dense(unique, activation='softmax'))
# Compile model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()
return model
I get the following error when I execute the code:
ValueError: Input 0 is incompatible with layer lstm_1: expected ndim=3, found ndim=2
I'm not sure where these variables are coming from although, maybe it's the classes? I have three classes of data ('POSITIVE', 'NEGATIVE', 'NEUTRAL') mapped to sets of well over 2,000 attribute values - they're statistical extractions of timewindowed EEG brainwave data from multiple electrodes classed by emotional state.
Note:
The 'input_dim=len(dataframe.columns)-2' produces the number of attributes (inputs), I do this as I'd like the script to work with CSV datasets of different sizes on the fly
Also, there are no tabs in my code pasted but it is indented and will compile
The full code is pasted here: https://pastebin.com/1aXp9uDA for presentation purposes. Apologies in advance for the terrible practices! This is just an initial project, I do plan on cleaning it all up later on!
In your original code you have an input dimension of 2 which is shaped (batch, feature). When you add an LSTM, you're telling Keras you want to do the classification given the last N timesteps, hence you need an input that is shaped (batch, timestep, feature). It's easy to think that an LSTM will look back across all inputs in the batch but unfortunately it doesn't. You have to manually organize your data to present all the timestep elements together.
To split up your data you generally do a sliding window of length N (where N is how many values you wish the LSTM to look back across). You can slide the window N steps each time, meaning there's no overlap of the data or you can simply slide it one sample, meaning you get multiple copies of your data. There are numerous blogs on how to do this. Take a look at this one How to Reshape Input Data for LSTM.
You also have one other issue. For your LSTM, you probably want "return_sequences=False". With this equal to True, you would need to have an output "Y" value for each element of your timestep. You probably want your "Y" value to represent the next value in your time-series. Keep this in mind when organizing your data.
The above link provides some nice examples or you can search for more in-depth ones. If you follow those it should be clear how to reorganize things for an LSTM.

Keras Dense Net Overfitting

I am attempting to use keras to build an activity classifier from accelerometer signals. However, I am experiencing extreme overfitting of the data even with the most simplistic of models.
The input data is of shape (10,3) and contains roughly .1 second of data from the accelerometer in 3 dimensions. The model is simply
model = Sequential()
model.add(Flatten(input_shape=(10,3)))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
The model should output the label [1,0] for walking activities and [0,1] for non-walking activities. After training I get 99.8% accuracy (if only it was real...). When I attempt to predict on data that wasn't used for training, I get 50% accuracy, verifying that the net isn't really "learning" anything except to predict a single class value.
The data is being prepared from 100hz triaxial accelerometer signals. I am not preprocessing the data in any way except for windowing it into bins on length 10 that overlap with the previous bin by 50%. What measures can I take to make the network produce actual predictions? I have tried increasing the window size but the results remain the same. Any advice/general tips are greatly appreciated.
Ian
Try adding some hidden layers and dropout layers to your network. You could create a simple Multi Layer Perceptron (MLP) with a couple of extra lines in between your Flatten layer and Dense layer:
model.add(Dense(64, activation='relu', input_dim=30))
model.add(Dropout(0.25))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
Or check out this guide. which explains how to create a simple MLP.
Without any hidden layers your model will not actually be 'learning' from the input data, rather it will be mapping the input number of features to the output number of features.
The more layers you add, the more intermediate features and patterns it should extract from the input data which should lead to better model predictions for test data. There will be a lot of trial and error to design the best model as too many layers can result in over fitting.
You have not provided information about how you train the model so that may be the cause of the issue as well. You must ensure that the data is spit into training, testing and validation sets. Some possible split ratios to use for training, validation, test data are: 60%:20%:20%, or 70%:15%:15%. This is ultimately something that you must also decide.
The problem of overfitting was caused by the input data type. The values passed to the classifier should have been float values with 2 decimal places. Somewhere along the way, some of these values had been augmented and had significantly more than 2 decimal places. That is, the input should have looked like
[9.81, 10.22, 11.3]
but instead looked like
[9.81000000012, 10.220010431, 11.3000000101]
The classifier was making its prediction based on this feature, which is obviously not the desired behavior! Lessoned learned - make sure the data preparation is consistent! Thanks to #umutto for the suggestions of random forests, the simple structure was helpful for diagnosing purposes.

Categories