After i did a lot of research about AI and sentiment analysis i found 2 ways to do text analysis.
After the pre-processing for text is done we must create a classification in order to get the positive and negative, so my question is it better to have example:
first way:
100 records of text to train that includes 2 fields text &
status filed that indicate if its positive 1 or negative 0.
second way:
100 records of text to train and make a vocabulary for bag of word in order to train and compare the tested records based on this bag of word.
if i am mistaking in my question please tel me and correct my question.
I think you might miss something here, so to train a sentiment analysis model, you will have a train data which every row has label (positive or negative) and a raw text. In order to make computer can understand or "see" the text is by representing the text as number (since computer cannot understand text), so one of the way to represent text as number is by using bag of words (there are other methods to represent text like TF/IDF, WORD2VEC, etc.). So when you train the model using data train, the program should preprocess the raw text, then it should make (in this case) a bag of words map where every element position represent one vocabulary, and it will become 1 or more if the word exist in the text and 0 if it doesn't exist.
Now suppose the training has finished, then the program produce a model, this model is what you save, so whenever you want to test a data, you don't need to re-train the program again. Now when you want to test, yes, you will use the bag of words mapping of the train data, suppose there is a word in the test dataset that never occurred in train dataset, then just map it as 0.
in short:
when you want to test, you have to use the bag of words mapping from the data train
Related
I have a list of twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology based on thier interest area.
I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets.
As mentioned here Tweet classification into multiple categories on (Unsupervised data/tweets) :
I am trying to generate dictionaries of common words under each category so that I can use it for classification.
Is there a method to generate these dictionaries for a custom set of words automatically?
Then I can use these for classifying the twitter data using a tf-idf classifier and get the degree of correspondence of the tweet to each of the categories. The highest value will give us the most probable category of the tweet.
But since the categorisation is based on these pre-generated dictionaries, I am looking for a way to generate them automatically for a custom list of categories.
Sample dictionaries :
Education - ['book','teacher','student'....]
Automobiles - ['car','auto','expo',....]
Example I/O:
**Input :**
UserA - "students visited share learning experience eye opening
article important preserve linaugural workshop students teachers
others know coding like know alphabets vision driving codeindia office
initiative get students tagging wrong people apologies apologies real
people work..."
.
.
UserN - <another corpus of cleaned tweets>
**Expected output** :
UserA - Education (61%)
UserN - Automobiles (43%)
TL;DR
Labels are necessary for supervised machine learning. And if you don't have training data that contains Xs (input texts) and Y (output labels) then (i) supervised learning might not be what you're looking for or (ii) you have to create a dataset with texts and their corresponding labels.
In Long
Lets try to break it down and see reflect what you're looking for.
I have a list twitter users (screen_names) and I need to categorise them into 7 pre-defined categories - Education, Art, Sports, Business, Politics, Automobiles, Technology
So your ultimate task is to label tweets into 7 categories.
I have extracted last 100 tweets of the users in Python and created a corpus for each user after cleaning the tweets.
100 data points is definitely insufficient to do anything if you want to train a supervised machine learning model from scratch.
Another thing is the definition of corpus. A corpus is a body of text so it's not wrong to call any list of strings a corpus. However, to do any supervised training, each text should come with the corresponding label(s)
But I see some people do unsupervised classification without any labels!
Now, that's an oxymoron =)
Unsupervised Classification
Yes, there are "unsupervised learning" which often means to learn representation of the inputs, generally the representation of the inpus is use to (i) generate or (ii) sample.
Generation from a representation means to create from the representation a data point that is similar to the data which an unsupervised model has learnt from. In the case of text process / NLP, this often means to generate new sentences from scratch, e.g. https://transformer.huggingface.co/
Sampling a representation means to give the unsupervised model a text and the model is expected to provide some signal from which the unsupervised model has learnt from. E.g. given a language model and novel sentence, we want to estimate the probability of the sentence, then we use this probability to compare across different sentences' probabilities.
Algorithmia has a nice summary blogpost https://algorithmia.com/blog/introduction-to-unsupervised-learning and a more modern perspective https://sites.google.com/view/berkeley-cs294-158-sp20/home
That's a whole lot of information but you don't tell me how to #$%^&-ing do unsupervised classification!
Yes, the oxymoron explanation isn't finished. If we look at text classification, what are we exactly doing?
We are fitting the input text into some pre-defined categories. In your case, the labels are pre-defined but
Q: Where exactly would the signal come from?
A: From the tweets, of course, stop distracting me! Tell me how to do classification!!!
Q: How do you tell the model that a tweet should be this label and not another label?
A: From the unsupervised learning, right? Isn't that what unsupervised learning supposed to do? To map the input texts to the output labels?
Precisely, that's the oxymoron,
Supervised learning maps the input texts to output labels not unsupervised learning
So what do I do? I need to use unsupervised learning and I want to do classification.
Then the question is ask is:
Do you have labelled data?
If no, then how to get labels?
Use proxies, find signals that tells you a certain tweet is a certain label, e.g. from the hashtags or make some assumptions that some people always tweets on certain category
Use existing tweet classifiers to label your data and then train the classification model on the data
Do I have to pay for these classifiers? Most often, yes you do. https://english.api.rakuten.net/search/text%20classification
If yes, then how much?
If it's too little,
then how to create more? Maybe https://machinelearningmastery.com/a-gentle-introduction-to-the-bootstrap-method/
or maybe use some modern post-training algorithm https://towardsdatascience.com/https-medium-com-chaturangarajapakshe-text-classification-with-transformer-models-d370944b50ca
How about all these AI I keep hearing about, that I can do classification with 3 lines of code.
Don't they use unsupervised language models that sounds like Sesame Street characters, e.g. ELMO, BERT, ERNIE?
I guess you mean something like https://github.com/ThilinaRajapakse/simpletransformers#text-classification
from simpletransformers.classification import ClassificationModel
import pandas as pd
# Train and Evaluation data needs to be in a Pandas Dataframe of two columns. The first column is the text with type str, and the second column is the label with type int.
train_data = [['Example sentence belonging to class 1', 1], ['Example sentence belonging to class 0', 0]]
train_df = pd.DataFrame(train_data)
eval_data = [['Example eval sentence belonging to class 1', 1], ['Example eval sentence belonging to class 0', 0]]
eval_df = pd.DataFrame(eval_data)
# Create a ClassificationModel
model = ClassificationModel('bert', 'bert-base') # You can set class weights by using the optional weight argument
# Train the model
model.train_model(train_df)
Take careful notice of the comment:
Train and Evaluation data needs to be in a Pandas Dataframe of two columns. The first column is the text with type str, and the second column is the label with type int.
Yes that's the more modern approach to:
First use a pre-trained language model to convert your texts into input representations
Then feed the input representations and their corresponding labels to a classifier
Note, you still can't avoid the fact that you need labels to train the supervised classifier
Wait a minute, you mean all these AI I keep hearing about is not "unsupervised classification".
Genau. There's really no such thing as "unsupervised classification" (yet), somehow the (i) labels needs to be manually defined, (ii) the mapping between the inputs to the labels should exist
The right word to define the paradigm would be transfer learning, where the language is
learned in a self-supervised manner (it's actually not truly unsupervised) so that the model learns to convert any text into some numerical representation
then use the numerical representation with labelled data to produce the classifier.
I have data with 2 important columns, Product Name and Product Category. I wanted to classify a search term into a category. The approach (in Python using Sklearn & DaskML) to create a classifier was:
Clean Product Name column for stopwords, numbers, etc.
Create 90% 10% train-test split
Convert text to vector using OneHotEncoder
Create classifier (Naive Bayes) on the training data
Test the classifier
I realized the OneHotEncoder (or any encoder) converts the text to numbers by creating a matrix keeping into account where and how many times a word occurs.
Q1. Do I need to convert from Word to Vectors before train-test split or after train-test split?
Q2. When I will search for new words (which may not be in the text already), how will I classify it because if I encode the search term, it will be irrelevant to the encoder used for the training data. Can anybody help me with the approach so that I can classify a search term into a category if the term doesn't exist in the training data?
Q1. Do I need to convert from Words to Vectors before train-test split?
Answer: Every algorithm takes input as some number representation of the inputs, so you have to convert from words to vectors. There is no alternative to this. Apart from OneHotEncode, there are other approaches like CountVectorizer and TfIdf-Vectorizer which are recommended to use instead of OneHotEncoding. You can read more about them here .
what are important steps for preprocess our Twitter texts to classify between binary classes. what I did is that I removed hashtag and keep it without hashtag, I also used some regular expression to remove special char, these are two function I used.
def removeusername(tweet):
return " ".join(word.strip() for word in re.split('#|_', tweet))
def removingSpecialchar(text):
return ' '.join(re.sub("(#[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)"," ",text).split())
what are other things to preprocess textdata. I have also used nltk stopword corpus to remove all stop words form the tokenize words.
I used NaiveBayes classifer in textblob to train data and I am getting 94% accuracy on training data and 82% on testing data. I want to know is there any other method to get good accuracies. By the way I am new in this Machine Learning field, I have a limited idea about all of it!
Well then you can start by play with the size of your vocabulary. You might exclude some of the words that are too frequent in your data (without being considered stop words). And also do the same with words that appear in only one tweet (misspelled words for example). Sklearn CountVectorizer allow to do this in an easy way have a look min_df and max_df parameters.
Since you are working with tweets you can also think in URL strings. Try to obtain some valuable information from links, there are lots of different options from simple stuff based on regular expressions that retrieve the domain name of the page to more complex NLP based methods that study the link content. Once more it's up to you!
I would also have a look at pronouns (if you are using sklearn) since by default replaces all of them to the keyword -PRON- . This is a classic solution that simplifies things but might end in a loss of information.
For preprocessing raw data, you can try:
Stop word removal.
Stemming or Lemmatization.
Exclude terms that are either too common or too rare.
Then a second step preprocessing is possible:
Construct a TFIDF matrix.
Construct or load pretrained wordEmbedding (Word2Vec, Fasttext, ...).
Then you can load result of the second steps into your model.
These are just the most common "method", many others exists.
I will let you check each one of these methods by yourself, but it is a good base.
There are no compulsory steps. For example, it is very common to remove stop words (also called functional words) such as "yes" , "no" , "with". But - in one of my pipelines, I skipped this step and the accuracy did not change. NLP is an experimental field , so the most important advice is to build a pipeline that run as quickly as possible, to define your goal, and to train with different parameters.
Before you move on, you need to make sure you training set is proper. What are you training for ? is your set clean (e.g the positive has only positives)? how do you define accuracy and why?
Now, the situation you described seems like a case of over-fitting. Why? because you get 94% accuracy on the training set, but only 82% on the test set.
This problem happens when you have a lot of features but relatively small training dataset - so the model is fitted best for the specific train set but fails to generalize.
Now, you did not specify the how large is your dataset, so I'm guessing between 50 and 500 tweets, which is too small given the English vocabulary of some 200k words or more. I would try one of the following options:
(1) Get more training data (at least 2000)
(2) Reduce the number of features, for example you can remove uncommon words, names - anything words that appears only small number of times
(3) Using a better classifier (Bayes is rather weak for NLP). Try SVM, or Deep Learning.
(4) Try regularization techniques
I'm new in machine learning algorithms. I extensively read the scikit learn website and other SO post, which led me to build my first machine learning algorithm using the RandomForestClassifier and LinearSVC.
I'm working on medical notes. Each stay of a patient is associated (or not) to a code corresponding to a complication (bleeding, infection, heart attack...)
Using the notes, fitted and transformed with Countvectorizer and tfidfTransformer, i can accurately predict most of the codes. However, i'd like to add more data to my training dataset: length of stay, number of operations, title of operations, ICU stay duration...etc...
After parsing the web and SO, i ended up by adding all continuous/binary/scaled value to my word frequency array.
e.g: [0,0,0.34,0,0.45,0, 2, 45] (last 2 numbers are added data, whereas previous one match countvectorizer and tfdif.fit_transform(train_set)
However, this seems to me to be a gross way to combine data, and a huge number of words could mask others data.
I tried to set my data like: [[0,0,0.34,0,0.45,0],[2],[45]] but it doesn't work.
I searched the web, but no real clue, even though i might not be the first one facing this issue...:p
Thanks for your help
Edit:
Thanks for your detailed valuable answer. I really appreciated. However, what is exactly the range 0-1: is it the {predict_proba} value (http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier.predict) ?. I understood that the score is the accuracy of the prediction model. Then when you have all your predictions depending of each variable, do you average all of them ? Eventually, i'm working with multiple outputs, i guess it's not a problem since i can get a prediction for each of the output (btw predict_proba(X) give me an array like [array([[0.,1.]]), array ([[0.2,0.8]]).....] with a random forest tree classifier. i guess one of the number is the probability of the output, but i haven't explored this yet !)
Your first solution of just appending to the list is the correct solution. However, you should think about what this is implying. If you have 100 words and add two additional features, each specific word will get the same "weight" as the added features - IE - your added features won't be treated very strongly in the model. Additionally, you're saying that the last feature with a value of 45 is 100x the value of the feature 4th from end (0.45).
One common way to get around that is to use an ensemble model. Instead of adding those features to your list of words and predicting, first build a prediction model just using the words. That prediction will be in the range 0-1 and will capture the "sentiment" of the article. Then, scale your other variables (minmax scaler, normal distribution, etc.). Finally, combine the score from the words with the last two scaled variables and run another prediction on a list like this [.86,.2,.65]. In this way, you have transformed all of the words to a sentiment score, which you can use as a feature.
Hope that helps.
EDIT PER YOUR UPDATE ABOVE
Yes, in this instance you could use the predict_proba, but really if everything is scaled correctly, and you are using 1/0 as your targets for a class you don't need the predict_proba. The idea is to take the prediction from the words and combine it with the other variables. You do not average the predictions, you make a prediction from the predictions! This is called ensemble learning. Train another model with the output of your predictions as the features. Here is a flow of what you need to do.
Thanks for your time and your detailed answer. I think i get it. In short:
Prediction based on words, and for each bag of words of the training set (t1), you pull out a "sentiment"
Create a new array for each training set row with the sentiment and others values->new training set(t2)
Make a prediction based on t2.
Apply previous steps to the test.
One more question though !
What is the "sentiment" value ?! For each bag of words, i have a sparse matrix (countvectorizer+tf_idf). So how do you calculate the sentiment ? Do you run each row of the test again the rest of the test ? and your sentiment is the clf.predict(X) value ?
What I am going to ask may sound very similar to the post Sentiment analysis with NLTK python for sentences using sample data or webservice? , But I am done with Parsing and Tokenization of sentences from text. My question is
Whatever examples till now I have seen in NLTK movie review example seems to be most similar to my problem, But for movie_review the training text is already in a form as it has two folders pos and neg and text are stored there. How can I do that classification for my huge text, Do I read data manually and store them into two folders. Does that make the corpus. After that can I work with them just like movie_review data in example?
2.If the answer to the above question is yes, is there any way to speed up that task by any tool. For example I want to work with only the texts which has "Monty Python" in there content. And then I classify them manually and then store them in pos and neg folder. Does that work?
Please help me
Yes, you need a training corpus to train a classifier. Or you need some other way to detect sentiment.
To create a training corpus, you can classify by hand, you can have others classify it for you (mechanical turk is popular for this), or you can do corpus bootstrapping. For sentiment, that could involve creating 2 lists of keywords, positive words and negative words. Using those, you can create an initial training corpus, correct it by hand, then train a classifier. This is an iterative process, and the key thing to remember is "garbage in, garbage out". In other words, if your training corpus is wrong, you can't expect your classifier to be right.