I'm trying to do text classification with scikit-learn.
I have text that does not classify well. I think I can improve the predictions by adding data I can deduce in the form of an array of integers.
For example, sample 1 would come with [3, 1, 5, 2] and sample 2 would come with [2, 1, 4, 2]. This would also be true of the test data.
The idea is that the classifier could use both the text and the numbers to classify the data.
I've read the documentation for scikit learn and I can't find how to do it. It must be possible because all that is classified, internally, is vectors of numbers. So adding another vector of numbers should not be that much of a problem, but I can't figure out how. partial_fit adds more samples, it does not add more information about the existing samples. Is there a way to do what I'm trying to do. I tried to combine GaussianNB with SGDClassifier, but it turns out I don't know how to do that. (Was it a bad idea?)
What should I do?
I think you could add this new feature as another dimension to your training data. You need to modify the training data by adding your new features before calling SGD.
A simple/naive way would be:
For example, if my training data with two samples were
X = [ [1,2,3], [8,9,0] ]
And my new features for each sample was
new_feature_X = [ [11,22,33] , [77,88,00] ]
My new training data would be:
X_new = [[1,2,3,11,22,33] , [8,9,0,77,88,00]]
Then you call SGD.fit(X_new, labels)
As far as my SGD knowledge goes, I don't think there is any other way to combine two features.
The idea is that the classifier could use both the text and the
numbers to classify the data.
I find a neural network to be much more suitable for this. You could use two input layers, one for text vectors, and one for the numbers and feed them together into a network to get the output.
I tried to combine GaussianNB with SGDClassifier, but it turns out I
don't know how to do that. (Was it a bad idea?)
SGD means stochastic gradient descent. Is it possible to find the gradient of NaiveBayes? Whats the corresponding cost function ?
What should I do?
Ensemble. Train two separate classifiers. One using your text data, and another one for your new handcrafted feature. And then take the average of their prediction probabilities. You could train multiple classifier and take their votes. This tutorial is great for that.
Try out MLP Classifier. I used it a while ago, and found it works pretty great with text.
Neural networks. It's pretty easy with Keras.
Read research literature. There is pretty good chance academia might have done some work on your dataset. Try to read some of them. Google scholar, semantic scholar are great places to find published reseaerch.
from keras.layers import Input, Dense,Concatenate
from keras.models import Model
# This returns a tensor
text_input_vec = Input(shape=(784,))
new_numeric_feature = Input(shape=(4,))
# feed your text to a dense layer
dense1 = Dense(64, activation='relu')(text_input_vec)
# feed your numeric feature to another dense layer
dense2 = Dense(64, activation='relu')(new_numeric_feature)
# concatenate/combine the output of both
concat = Concatenate(axis=-1)([dense1,dense2])
# use the above to predict the label of your text. Layer below
# assumes you have 2 classes
predictions = Dense(2, activation='softmax')(concat)
model = Model(inputs=[text_input_vec,new_numeric_feature], outputs=predictions)
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['accuracy'])
model.summary()
Related
I am trying to train the following RNN in tensorflow. It takes an 11-D numeric vector as input and it outputs a sequence of 10 multiclass probability vectors, with 14 exclusive classes.
model = keras.models.Sequential([
keras.layers.SimpleRNN(30, return_sequences=False, input_shape=[1, 11]),
keras.layers.RepeatVector(10),
keras.layers.SimpleRNN(30, return_sequences=True),
keras.layers.SimpleRNN(14, return_sequences=True, activation="softmax")
])
model.compile(loss="categorical_crossentropy",
optimizer="adam")
history = model.fit(X_train, y_train, epochs=50, batch_size=32,
validation_split=0.2)
However, even for a small dataset of 10 points, it takes hundreds of epochs to fit. As you can see in the figure, the loss barely goes down with the training epochs:
When I try to train the real training set, the loss simply does not move. Any idea of how to successfully train this model?
You can find the first 10 datapoints here
And the first 100 datapoints here
To load the data just use:
with open('train10.pickle', 'rb') as f:
X_train, y_train = pickle.load(f)
Thank you very much for your help
EDIT:
To provide additional context, what I have in this problem is a continuous numeric embedding in 11-D to start with, and the output is a sequence of one-hot encodings, so you can think of this problem as training a decoder or doing a decompression to get a sort of "words" back from points in the numeric space (each one-hot vector in the output could be thought of a "letter"). I previously tried to train a non-recurrent network outputting the full list of one-hot encodings (whole "word") at once, but the performance was also very poor. I just do not see what the bottleneck is: if the dimensionality of the numeric embedding, the training algorithm, etc. My tinkering so far with types of layers, numbers of layers, or learning rates did not produce substantial improvements. I am open to sharing the whole dataset if you think that can help. Thank you very much!
Each machine learning problem is unique and it is very difficult to say exactly what the issue is without having access to the full data set. Some possibilities are:
The model specification is suboptimal - try varying the number of hidden layers, the number of neurons in each layer, using GRU/LSTM layers instead of RNN, adding add some dropout layers, etc.
The training algorithm needs to be adjusted - try using a different optimizer, a different batch size, a different train-test split ratio etc.
The input data needs more (or less) preprocessing - try normalizing/standardizing the input features if you haven't already.
You need to do more work on feature engineering - think deeply about all potential relationships between the input data and the target, and try combining columns to create ratios etc. While the NN can theoretically figure this out for itself, it is often effective to try and reduce the work it has to do in this respect.
Your problem may just be difficult or even unsolvable. There may just be no strong relationship between the input and the target.
I am new to machine learning and experimented a bit with neural networks and did some research.
I am currently trying to make a mini network for fake news detection.
My data has several features (statement,speaker,date,topic..), so far I've tried using simply the text of the false and true statements as input for my network and used glove for word embeddings. I tried out the following network:
model = tf.keras.Sequential(
[
# part 1: word and sequence processing
tf.keras.layers.Embedding(embeddings_matrix.shape[0], embeddings_matrix.shape[1], weights=[embeddings_matrix], trainable=True),
tf.keras.layers.Conv1D(128, 5, activation='relu'),
tf.keras.layers.GlobalMaxPooling1D(),
tf.keras.layers.Dropout(0.2),
# part 2: classification
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
It gives me over 90% training accuracy and around 65% testing accuracy.
So now i want to try adding 2 more features, the speaker, which is given as [firstname,lastname] and the topic, which can be just one or several words (e.g vaccines or politics-elections-2016), which i decided to limit/pad to 5 words.
Now i don't know how to combine those features into one model. I don't think using word embeddings on the other features makes sense. Do i have to make 3 different networks and concatenate them into one? can I use both the speaker and topic as inputs for the same network? if i concatenate them, how would i do so (using the already classified output for each as input?) ?
I have in the past concatenated different input types like this. You can’t use Sequential anymore, the docs say
A Sequential model is not appropriate when:
• Your model has multiple inputs or multiple outputs
• Any of your layers has multiple inputs or multiple outputs
In the Keras.functional docs, there is a section called “Models with multiple inputs and outputs” that is almost the exact problem you are asking about.
However, I wrote an answer to say: don’t rule out turning your structured data into text, such as prepending “xxspeaker firstname lastname xxtopic topicname” to the text you currently have. It might not work for your current model, which seems pretty small... but if you were to be using a larger model, or fine-tuning a large LM for the task, like if you were using fast.ai or huggingface, you almost certainly have the capacity to just learn it from text.
I saw code like this
self.feature = model_func()
if loss_type == 'softmax':
self.classifier = nn.Linear(self.feature.final_feat_dim, num_class)
self.classifier.bias.data.fill_(0)
elif loss_type == 'dist': #Baseline ++
self.classifier = backbone.distLinear(self.feature.final_feat_dim, num_class)
where model_func is a ConvNet 4/6 or ResNet 10/18/34/101
What here is classifier?
I know that in neural networks we have parameters that we learn, buffers that are used to store something that gets updated during training, activations that are the results after each layer.
Is feature same as activation, and what is a classifier, where is the end of features, and the beginning of classifier in a neural network? Is the result of a classifier also activation?
I find the question a little messy, but I'll give my best from what I understand you're asking.
What here is classifier?
The classifier would be the model itself. The model is the one who will, after being trained, be able to classify new data.
Is feature same as activation
I don't know what kind of feature you have in mind. In data science context, a feature is understood to be one of the variables of the data one has. For example, if you have a dataset about houses, you may have features such as latitude, long., if it has a pool, how many bedrooms it has, etc.
Activation functions are mathematical equations that determine the output of a neural network. The function is attached to each neuron in the network, and determines whether it should be activated (“fired”) or not, based on whether each neuron’s input is relevant for the model’s prediction. [1]
I'm not sure I'm truly understanding what you're asking.
Is the result of a classifier also activation?
The result of a classifier is the label, the class to which each data point belongs. Activation functions are used by neural networks in the process of classifying.
Hope this helps!
[1] https://missinglink.ai/guides/neural-network-concepts/7-types-neural-network-activation-functions-right/
So I am dealing with a simple neural network with 10 inputs and one output. I can have as many hidden layers as suggested, however I am using 2. I am also using "mean_squared_error" loss function and RMSProp optimizer.
Anyhow, the question I have is, lets suppose my output values are like this:
[0,0,3,0,0,0,5,0,0,2,0...] etc. Note, that value 0 repeats more often. So What I would love to do, is to try to force Neural Network to learn better in case "non zero values on the output side". To give more of an "importance" to those values.
Because if I use 'mean_squared_error', the training will try to optimize according to entire dataset, this will lead mostly to optimization of cases, where 0 is an output value.
EDIT:
The problem I am dealing with, could be simple modeling of physical system. Let us say, we have a black-box system with known inputs. This black-box has a single outputs (let us say temperature). Based on our inputs and corresponding outputs, we could model the system using Neural Network as a "black-box" and then use the trained NN to predict temperature.
EDIT:
So I am now using different training/validation set. I was suspecting that there is something wrong with the previous one.
Now I got something like the image above (please see the immediate spike)
What could cause that?
Keep in mind, I am not experienced in NNs, so literally any feedback are welcomed :)
there are two important concepts in ML.
"underfitting" and "overfitting", which in your case I think it's underfitting.
to overcome this problem there are some ways:
make your model more complex by adding more layers and units
if you are using regularization terms, decrease their values
use more features (if there is any)
hope this help you.
If your outputs are integers [0,0,3,0,0,0,5,0,0,2,0...], i.e., classes, you will probably do a classification. So, your loss should be categorical_crossentopy. In this case, there are two ways of doing what you want:
1- You can use SMOTE, Synthetic Minority Oversampling technique so that the non-zero classes get the same weight as the zero-class. For binary classes:
from imblearn.over_sampling import SMOTE
from imblearn.combine import SMOTEENN
sm = SMOTEENN()
x, y = sm.fit_sample(X, Y)
2- You can also adjust Keras class weights:
class_weight = {0: 1.,1: 30.}
model.fit(X, Y, nb_epoch=1000, batch_size=16, class_weight=class_weight)
Currently I'm using VGG16 + Keras + Theano thought the Transfer Learning methodology to recognize plants classes. It works just fine and gives me a good accuracy. But the next problem I'm trying to solve - is to find a way of identifying if an input image contains plant at all. I don't want to have another one classifier that will do it, because it's not really efficiently.
So I did some search and have found that we can get activations from the latest model layer (before activation layer) and analyze it.
from keras import backend as K
model = util.load_model() # VGG16 model
model.load_weights(path_to_weights)
def get_activations(m, layer, X_batch):
x = [m.layers[0].input, K.learning_phase()]
y = [m.get_layer(layer).output]
get_activations = K.function(x, y)
activations = get_activations([X_batch, 0])
# trying to get some features from activations
# to understand how can we identify if an image is relevant
for l in activations[0]:
not_nulls = [x for x in l if x > 0]
# shows percentage of activated neurons
c1 = float(len(not_nulls)) / len(l)
n_activated = len(not_nulls)
print 'c1:{}, n_activated:{}'.format(c1, n_activated)
return activations
get_activations(model, 'the_latest_layer_name', inputs)
From the above code I've noticed that when we have very irrelevant image, the number of activated neurons is bigger than for images that contain plants:
For images that was using for model training, number of activated neurons 19%-23%
For images that contain unknown plants species 20%-26%
For irrelevant images 24%-28%
It's not really a good feature to understand if an image relevant as percentage values are intersect.
So, is there a good way to resolve this issue?
Thanks to Feras's idea in the comment above. After some trials, I've come up with the ultimate solution that allows solving this problem with accuracy up to 99.99%.
Steps are:
Train your model on a dataset;
Store activations (see method above how to get them) by predicting relevant and non-relevant images using trained model from the previous step. You should get activations from the penultimate layer. For VGG16 it's the last of two Dense(4096), for InceptionV3 - an extra penultimate Dense(1024) layer, for resnet50 - an extra penultimate Dense(2048) layer.
Solve a binary problem using stored activations data. I've tried a simple flat NN and Logistic Regression. Both were good in accuracy (flat NN was a bit more accurate), but I've chosen the Logistic Regression as it's simpler, faster and consumes less memory and CPU/GPU.
This process should be repeated each time after your model retrained as each time the final weights for CNN are different and what was working previously, will be different next time.
So as result we have another small model for solving the problem.