For my master thesis i'm developing a system to classify and extract cybersecurity countermeasures from unstructured texts.
In my binary classifier I want to check if a text is relevant or not. For this purpose I tried two approaches:
Scikit-Learn Support Vector Machines:
I used the paper by Husari et al. as a guide https://www.researchgate.net/publication/321503662_TTPDrill_Automatic_and_Accurate_Extraction_of_Threat_Actions_from_Unstructured_Text_of_CTI_Sources.
They used three features for their svm classifier
My Question: How can I add Features to SVM classifier?
BERT with pytorch
I created a dataset with manually labeled texts (100; 30 relevant; 70 not relevant)
Output 70 % accuracy and 61 % loss seems not good enough
I think it's because of the small dataset
My Question: Is there another possibility to use BERT with small datasets to get more accurate results?
Related
I have a large training set of words labeled pos and neg to classify texts. I used TextBlob (according to this tutorial) to classify texts. While it works fairly well, it can be very slow for a large training set (e.g. 8k words).
I would like to try doing this with scikit-learn but I'm not sure where to start. What would the above tutorial look like in scikit-learn? I'd also like the training set to include weights for certain words. Some that should pretty much guarantee that a particular text is classed as "positive" while others guarantee that it's classed as "negative". And lastly, is there a way to imply that certain parts of the analyzed text are more valuable than others?
Any pointers to existing tutorials or docs appreciated!
There is an excellent chapter on this topic in Sebastian Raschka's Python Machine Learning book and the code can be found here: https://github.com/rasbt/python-machine-learning-book/blob/master/code/ch08/ch08.ipynb.
He does sentiment analysis (what you are trying to do) on an IMDB dataset. His data is not as clean as yours - from the looks of it - so he needs to do a bit more pre-processing work. Your problem can be solved with these steps:
Create numerical features by vectorizing your text: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html
Train test split: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
Train and test your favourite model, e.g.: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
There are many ways to do this like Tf-Idf (Term frequency - Inverse Document Frequency), Count Vectorizer, Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), Word2Vec.
Among all of above mentioned methods, Word2Vec is the best method. You can use a pre-trained model by Google for Word2Vec, available on:
https://github.com/mmihaltz/word2vec-GoogleNews-vectors
I have a dataset that comprises of sentences and corresponding multi-labels (e.g. a sentence can belong to multiple labels). Using a combination of Convolutional Neural Networks and Recurrent Neural Nets on language models (Word2Vec) I'm able to achieve a good accuracy. However, it's /too/ good at modelling the output, in the sense that a lot of labels are arguably wrong and thus the output too. This means that the evaluation (even with regularization and dropout) gives a wrong impression, since I have no ground truth. Cleaning up the labels would be prohibitively expensive. So I'm left to explore "denoising" the labels somehow. I've looked at things like "Learning from Massive Noisy Labeled Data for Image Classification", however they assume to learn some sort of noise covariace matrix on the outputs, which I'm not sure how to do in Keras.
Has anyone dealt with the problem of noisy labels in a mutli-label text classification setting before (ideally using Keras or similar) and has good ideas on how to learn a robust model with noisy labels?
The cleanlab Python package, pip install cleanlab, for which I am an author, was designed to solve this task: https://github.com/cleanlab/cleanlab/. It's a professional package created for finding labels errors in datasets and learning with noisy labels. It works with any scikit-learn model out-of-the-box and can be used with PyTorch, FastText, Tensorflow, etc.
(UPDATED Sep 2022) I've added resources for exactly this task (text classification with noisy labels (labels that are sometimes flipped to other classes):
Blog: https://cleanlab.ai/blog/label-errors-text-datasets/|
Runnable Colab Notebook: https://docs.cleanlab.ai/stable/tutorials/text.html
Example -- Find label errors in your dataset.
from cleanlab.classification import CleanLearning
from cleanlab.filter import find_label_issues
from cleanlab.count import estimate_cv_predicted_probabilities
# OPTION 1 - 1 line of code for sklearn compatible models
issues = CleanLearning(sklearnModel, seed=SEED).find_label_issues(data, labels)
# OPTION 2 - 2 lines of code to use ANY model
# just pass in out-of-sample predicted probabilities
pred_probs = estimate_cv_predicted_probabilities(data, labels)
ordered_label_issues = find_label_issues(
labels=labels,
pred_probs=pred_probs,
return_indices_ranked_by='self_confidence',
)
Details on how to compute out-of-sample predicted probabilities with any model here.
Example -- Learning with Noisy Labels
Train an ML model on noisy labels like it was trained on perfect labels.
# Code taken from https://github.com/cleanlab/cleanlab
from sklearn.linear_model import LogisticRegression
# Learning with noisy labels in 3 lines of code.
cl = CleanLearning(clf=LogisticRegression()) # any sklearn-compatible classifier
cl.fit(X=train_data, labels=labels)
# Estimate the predictions you would have gotten training with error-free labels.
predictions = cl.predict(test_data)
Given that you also may be working with image classification and audio classification, here are working examples for Image Classification with PyTorch and Audio Classification with SpeechBrain.
Additional documentation is available here: docs.cleanlab.ai
The data that I've got are mostly tweets or small comments (300-400 chars). I used a Bag-Of-Word model and used NaiveBayes classification. Now I'm having a lot of misclassified cases which are of the type mentioned below :-
1.] He sucked on a lemon early morning to get rid of hangover.
2.] That movie sucked big time.
Now the problem is that during sentiment classification both are getting "Negative" just because of the word "sucked"
Sentiment Classification : 1.] Negative 2.] Negative
Similarly during document classification both are getting classified into "movies" due to the presence of word "sucked".
Document classification : 1.] Movie 2.] Movie
This is just one of such instances, I'm facing a huge number of wrong classifications and don't have any idea on how to improve the accuracy.
(1)
One straightforward possible change from Bag-of-Words with Naive Bayes is to generate polynomial combinations of Bag-of-Words features. It might solve the problems you have shown above.
"sucked" + "lemon" (positive)
"sucked" + "movie" (negative)
Of course, you can also generate polynomial combinations of n-grams but the number of features might be too large.
The scikit-learn library prepares a preprocessing class for the purpose.
sklearn.preprocessing.PolynomialFeatures (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html)
Theoretically, SVM with the polynomial kernel does the same thing as PolynomialFeatures + linear SVM but slightly different regarding how you store the model information.
In my experience, PolynomialFeatures + linear SVM performs reasonably well for short text classification including sentiment analysis.
If the dataset size is not large enough, the training dataset might not contain "sucked" + "lemon". In the case, dimensionality reduction such as Singular Value Decomposition (SVD) and topic models such as Latent Dirichlet Allocation (LDA) are suitable tools to semantic clusters for words.
(2)
Another direction is to utilize more sophisticated natural language processing (NLP) techniques to extract additional information from short texts. For example, Part-of-Speech (POS) tagging, Named Entity Recognition (NER) will give more information than plain BoWs. A python library for NLP called Natural Language Toolkit (NLTK) implements those functions.
(3)
You can also take slow but steady way. Analyzing errors in prediction by the current model to design new hand-crafted features is a promising way to improve the accuracy of the model.
There is a library for short-text classification called LibShortText, which also contains an error analysis function and preprocessing functions such as TF-IDF weighting. It might help you to learn how to improve the model via error analysis.
LibShortText (https://www.csie.ntu.edu.tw/~cjlin/libshorttext/)
(4)
For further information, take a look at the literature on sentiment analysis of Tweets will give you more advanced information.
Maybe you could try to use a more powerful classifier like Support Vector Machines. Also depending on the amount of data you have, you could try deep learning with convolutional neural nets. For this you will need a huge number of training examples (100k-1M).
I'm am trying to identify phonemes in voices using a training database of known ones.
I'm wondering if there is a way of identifying common features within my training sample and using that to classify a new one.
It seems like there are two paths:
Give the process raw/normalised data and it will return similar ones
Extract certain metrics such as pitch, formants etc and compare to training set
My interest is the first!
Any recommendations on machine learning or regression methods/algorithms?
Since you tagged Python, I highly recommend looking into scikit-learn, an excellent Python library for Machine Learning. Their docs are very thorough, and should give you a good crash course in Machine Learning algorithms and implementation (including classification, regression, clustering, etc)
Your points 1 and 2 are not very different: 1) is the end results of a classification problem 2) is the feature that you give for classification. What you need is a good classifier (SVM, decision trees, hierarchical classifiers etc.) and a good set of features (pitch, formants etc. that you mentioned).
Currently I am working for a project to classify a given set of test images into one of the 5 predefined categories. I implemented Logistic Regression with a feature vector of 240 features for each image and trained it using 100 images/ category. The learning accuracy I achieved was ~98% for each category, whereas when tested on validation set consisting of 500 images (100 images/category), only ~57% images were rightly classified.
Please suggest me few libraries/tools which I can use (preferably based on Neural Network) in order to attain higher accuracy.
I tried using a Java based tool, Neurophy (neuroph.sourceforge.net) on windows but, it didn't run as expected.
Edit: The feature vector were already provided for the project. I am also looking for a better feature extraction tool for Images.
You can get help from this paper Image Classification
In My opinion, SVM is relatively better than logistic regression when it comes to multi-class response problems. We use it in e commerce classification of product where there are 1000s of response level and thousands of features.
Based on your tags I assume you would like a python package, scikit-learn has good classification routines: scikit-learn.org.
I have had good success using the WEKA tools, you need to isolate the feature set that you are interested in and then apply a classifier from this library. The examples are very clear. http://weka.wikispaces.com