Dealing with noisy training labels in text classification using deep learning - python

I have a dataset that comprises of sentences and corresponding multi-labels (e.g. a sentence can belong to multiple labels). Using a combination of Convolutional Neural Networks and Recurrent Neural Nets on language models (Word2Vec) I'm able to achieve a good accuracy. However, it's /too/ good at modelling the output, in the sense that a lot of labels are arguably wrong and thus the output too. This means that the evaluation (even with regularization and dropout) gives a wrong impression, since I have no ground truth. Cleaning up the labels would be prohibitively expensive. So I'm left to explore "denoising" the labels somehow. I've looked at things like "Learning from Massive Noisy Labeled Data for Image Classification", however they assume to learn some sort of noise covariace matrix on the outputs, which I'm not sure how to do in Keras.
Has anyone dealt with the problem of noisy labels in a mutli-label text classification setting before (ideally using Keras or similar) and has good ideas on how to learn a robust model with noisy labels?

The cleanlab Python package, pip install cleanlab, for which I am an author, was designed to solve this task: https://github.com/cleanlab/cleanlab/. It's a professional package created for finding labels errors in datasets and learning with noisy labels. It works with any scikit-learn model out-of-the-box and can be used with PyTorch, FastText, Tensorflow, etc.
(UPDATED Sep 2022) I've added resources for exactly this task (text classification with noisy labels (labels that are sometimes flipped to other classes):
Blog: https://cleanlab.ai/blog/label-errors-text-datasets/|
Runnable Colab Notebook: https://docs.cleanlab.ai/stable/tutorials/text.html
Example -- Find label errors in your dataset.
from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues
from cleanlab.count import estimate_cv_predicted_probabilities
# OPTION 1 - 1 line of code for sklearn compatible models
issues = CleanLearning(sklearnModel, seed=SEED).find_label_issues(data, labels)
# OPTION 2 - 2 lines of code to use ANY model
# just pass in out-of-sample predicted probabilities
pred_probs = estimate_cv_predicted_probabilities(data, labels)
ordered_label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs,
return_indices_ranked_by='self_confidence',
)
Details on how to compute out-of-sample predicted probabilities with any model here.
Example -- Learning with Noisy Labels
Train an ML model on noisy labels like it was trained on perfect labels.
# Code taken from https://github.com/cleanlab/cleanlab
from sklearn.linear_model import LogisticRegression
# Learning with noisy labels in 3 lines of code.
cl = CleanLearning(clf=LogisticRegression()) # any sklearn-compatible classifier
cl.fit(X=train_data, labels=labels)
# Estimate the predictions you would have gotten training with error-free labels.
predictions = cl.predict(test_data)
Given that you also may be working with image classification and audio classification, here are working examples for Image Classification with PyTorch and Audio Classification with SpeechBrain.
Additional documentation is available here: docs.cleanlab.ai

Related

How to train a machine learning model in python including several target variables

I am trying to build a machine learning model in python. I used pytorch and sklearn to make the model. My model is a bit complicated: I have one input feature but several target variables. My target variables are values making a curve and I used each value of the curve as a different feature. I showed five different curves in the upladed figure.
I used algorithms like DecisionTreeRegressor and RandomeForestRegressor to fit the only input variable to several target variables. But the prediction of trained model is not so well for extrapolation. The trained model can create the a series of data but not so accure. Does anyone know such trained model in Python? I tried hyperparameter tuning using GridSearchCV but it did not help me.
In advance I do appreciate your help and feedback.

How to solve classification problem where Dataset amount is low and features between two classes are similar/confusable

I am currently training a cnn to classify ICs into the classes "scratch" and "no scratch" (binary classification). I am fairly new to deep learning and when I trained my cnn a little bit I got very good accuracies (Good validation accuracy as well). But I quickly learned that my models where not as good as I thought, because when using them on a dataset to test it, it got quite a lot of false classification (false positives and false negatives). In my opinion there are 2 problems:
There is too little training data (about 1000 each class)
The ICs have markings (text) on it, which changes with every batch, so my training data has images of ICs with varying markings on it. And since some batches have more scratched ICs and other have less or none, the amount of IC images with different markings is unbalanced
here are two example images of 2 ICs from training set of the class "scratch":
As you see the text varies very strong. Every line has different characters and the amount of characters also varies.
I ask myself how the cnn should be able to differentiate between a scratch and an character?
Nevertheless I am trying to train my cnn and this is for example one model I currently trained (the other models look quite similar):
There are some points while training where the validation accuracy gets up and then down again. What could that mean? I think it is something like that there is a feature in the val data set that is not covered in my training set. Could this be the cause?
As you see Data Augmentation is no option (Or so I think) because of the text. One thing that came into my mind is to seperate the marking and the IC (cut out text region) with preprocessing (Don't know how I could do it properly and fast) and then classfy them seperately, but I don't know if this would be the right approach.
I first used VGG16, ResNet and InceptionV3 (with transfer learning). Now I tried to train my custom cnn (inspired by VGG but with 10 layers similar to this: https://journals.sagepub.com/doi/full/10.1177/1558925019897396)
Do you guys know how I should approach this problem or do you have any tips?

Snorkel: Can i have different features in data set to for generating labelling function VS training a classifier?

I have a set of features to build labelling functions (set A)
and another set of features to train a sklearn classifier (set B)
The generative model will output a set of probabilisitic labels which i can use to train my classifier.
Do i need to add in the features (set A) that i used for the labelling functions into my classifier features? (set B)
Or just use the labels generated to train my classifier?
I was referencing the snorkel spam tutorial and i did not see them use the features in the labelling function set to train a new classifier.
As seem in cell 47, featurization is done entirely using a CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())
X_dev = vectorizer.transform(df_dev.text.tolist())
X_valid = vectorizer.transform(df_valid.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())
And then straight to fitting a keras model:
# Define a vanilla logistic regression model with Keras
keras_model = get_keras_logreg(input_dim=X_train.shape[1])
keras_model.fit(
x=X_train,
y=probs_train_filtered,
validation_data=(X_valid, preds_to_probs(Y_valid, 2)),
callbacks=[get_keras_early_stopping()],
epochs=50,
verbose=0,
)
I asked the same question to the snorkel github page and this is the response :
you do not need to add in the features (set A) that you used for LFs
into the classifier features. In order to prevent the end model from
simply overfitting to the labeling functions, it is better if the
features for the LFs and end model (set A and set B) are as different
as possible
https://github.com/snorkel-team/snorkel-tutorials/issues/193#issuecomment-576450705
From your linked snorkel tutorial, the labeling functions (which maps input to labels ("HAM", "SPAM", "Abstain") are used to provide labels instead of features.
IIUC, the idea is to generate labels when you do not have good quality human labels. Though these "auto-generated" labels would be quite noisy, it could be served as a starting point of a labeled dataset. The learning process is to take this dataset and learn a model, which encodes the knowledge embedded in these labeling functions. Hopefully the model could be more general and the model could be applied to unseen data.
If some of these labeling function (can be considered as fixed rules instead) are very stable (regarding prediction accuracy) in certain conditions, given enough training data, your model should be able to learn that. However, in production system, to overcome the possibility of model instability, one easy fix is to override machine prediction with human labels on seen data. The same idea can be applied too if you think these labeling functions could be used for some specific input (pattern). In this case, the labeling functions would be used to directly get labels to override machine predictions. This process can be implemented as a pre-check before your machine-learned model runs.

Frechet Inception Distance for DC GAN trained on MNIST Dataset

I'm starting out with GANs and I am training a DC-GAN on MNIST dataset. I want to evaluate my model using Frechet Inception Distance (FID).
Since Inception network is not trained to classify MNIST digits, can I use any simple MNIST classifier or are there any conditions on what kind of classifier I need to use? Or should I use Inception net only? I have few other questions
Does it make sense to compute FID for MNIST GAN?
How many images from real dataset should be used while computing FID
For a classifier I'm using, I'm getting FID in the order of 10^6. Is the value okay or is something horribly wrong?
If you can answer any of these questions, even partially, that would be of immense help to me. Thanks!
you can refer this.
Use a auto-encoder trained on MNIST and the bottleneck activations as the features as explained here
Model trained on Mnist dont do well on FID computation. As far as I can tell, major reasons are data distribution is too narrow(Gan images are too far from distribution model is trained on) and model is not deep enough to learn a lot of feature variation.
Training a few-convolutional layers model gives 10^6 values on FID. To test the above hypothesis, just adding L2 regularization, the FID values dropped to around 3k(confirming to data distribution being narrow), however the FID value dont improve as GAN training goes on. :(.
Finally, directly computing FID values from Inception network gives a nice plot as images becomes better.
[Note:- You need to rescale mnist image and convert to RGB by repeating one channel 3 times. Make sure real image and generated image have same intensity scales.]

What it mean by Training SVM

I am new to image processing.As my project i am doing "image classifier using SVM".I have the idea of my final software "I select some image and give it as input to my software and it will classify that image .if i give the image of an animal it will classify it to cat or snake suitably"
When I google about it.it says "First you need to train SVM"
What it mean by Training SVM?
What is the actual input to SVM in my case(image classification)?
SVM is just a classifier how it classify images.Is it necessary for me to covert image to any particular format?.please help.
Support Vector Machine (SVM) is a machine learning model for supervised data classification. SVMs essentially learn a hyper-plane which separates the data space into 2 regions (in 2 class case). In your case, suppose you have images of snakes and cats and you need to classify them. The steps you'll need to follow are
Extract 'features' from the images.
These 'features' may be functions of appearance of snake/cat in your case e.g colour of the animal, shape of the animal etc. By concatenating these features you can get a multi-dimensional feature vector.
Train an SVM classifier
Training essentially learns a separating hyper-plane between the feature vectors of snake class and cat class . For example, if your feature vector is 2-dimensional, training an SVM classifier would amount to 'learning' a line which best separates your labeled-data/training-data.
You could use any of the multitude of freely available libraries of machine learning. In case you speak python, you could use sklearn for the task.
This task of learning (hyper-plane in linear SVM) is referred to training.
Classify the images.
Once you have trained your model, you could then use it classify images whose class is not known.
Note: I am simplifying a lot of details/issues involved in this answer. I suggest you should read-up about SVM

Categories