I have data with 2 important columns, Product Name and Product Category. I wanted to classify a search term into a category. The approach (in Python using Sklearn & DaskML) to create a classifier was:
Clean Product Name column for stopwords, numbers, etc.
Create 90% 10% train-test split
Convert text to vector using OneHotEncoder
Create classifier (Naive Bayes) on the training data
Test the classifier
I realized the OneHotEncoder (or any encoder) converts the text to numbers by creating a matrix keeping into account where and how many times a word occurs.
Q1. Do I need to convert from Word to Vectors before train-test split or after train-test split?
Q2. When I will search for new words (which may not be in the text already), how will I classify it because if I encode the search term, it will be irrelevant to the encoder used for the training data. Can anybody help me with the approach so that I can classify a search term into a category if the term doesn't exist in the training data?
Q1. Do I need to convert from Words to Vectors before train-test split?
Answer: Every algorithm takes input as some number representation of the inputs, so you have to convert from words to vectors. There is no alternative to this. Apart from OneHotEncode, there are other approaches like CountVectorizer and TfIdf-Vectorizer which are recommended to use instead of OneHotEncoding. You can read more about them here .
Related
I have used TF-IDF to extract features from a sentiment annotated dataset, I have used the extracted features to train a ML model using random forest algorithm. Is it possible for me to now input a sentence into the model and have it return what it believes the sentiment is?
I would need to take that sentence and convert it to TF-IDF values for my model to understand it.
Do i need to recalculate TF-IDF values for the entire dataset in order to get the values for this new sentence ?
Does anyone know a way of doing this preferably in python?
my data contains sequence of letters for my classification problem. I can turn these sequences to numeric data using kmer (3 letter words are formed), join them and using countvectoriser (how many times the word appears in the sequence instance), I get the matrix of numbers.
I do split the data using train_test_split function.
As we know at the training time, there should not be any information of the test data. If the countvectoriser is fitted on the whole data the unique words from the test would also be known.
So am I correct in saying, countvectoriser needs to b fitted on the train data (unique words only from train data) and using this cv, transform the train and test data?
Yes you are right you don't want to leak any information from the test data to the train data so "countvectoriser needs to b fitted on the train data (unique words only from train data) and using this cv, transform the train and test data" is the right practice.
Let's say i have a dataset consisting of a review column with exactly 100 words for each review, then it may be easy to train my model as i can simply tokenize each of the 100 words for each reviews then convert it into a numerical array and then feed it into a Sequential model with input_shape=(1,100). But in the real world, reviews are never the same size. If I use a function such as CountVectorizer, then the structure of the sentence is not reserved, and one hot encoding may not be efficient enough.
So what is the proper way to preprocess this particular dataset so that i feed it into a trainable NN
A common way to represent text as vectors is by utilizing word embeddings. The main idea is that you used a large text corpus to compute vector representations of all words occurring in that dataset. So now for each review, you could run the following algorithm to compute its vector representation:
For each word in the review, check if a word embedding exists (in other words, that word occurred in the large training corpus) and if it does, add its vector representation to the representation of the review
Once you summed up the vector representations of all words, you compute the average embedding by dividing the summed review vector by the number of words in the document and this results in the final vector representation for that document
This vector can now be fed into a trainable NN
Before performing steps 1-3, you could also apply more preprocessing steps and remove fill words such as "and", "or", etc. as they usually carry no meaning, you could convert words to lower case and apply other standard NLP (natural language processing techniques) which could affect the vector representation of the reviews. But the key idea is to sum up the word vectors of a review and use its averaged vector as the representation of the review. By averaging, the length of the reviews is unimportant. Similarly, in word embeddings, the dimensionality of the word vectors is fixed (100D, 200D, ...), so you can experiment with the most suitable dimensionality.
Note that there are many different models available that compute word embeddings, so you could choose any of them. One that is nicely integrated into Python is word2vec.
And a state-of-the-art model that is currently being used by Google is called BERT.
After i did a lot of research about AI and sentiment analysis i found 2 ways to do text analysis.
After the pre-processing for text is done we must create a classification in order to get the positive and negative, so my question is it better to have example:
first way:
100 records of text to train that includes 2 fields text &
status filed that indicate if its positive 1 or negative 0.
second way:
100 records of text to train and make a vocabulary for bag of word in order to train and compare the tested records based on this bag of word.
if i am mistaking in my question please tel me and correct my question.
I think you might miss something here, so to train a sentiment analysis model, you will have a train data which every row has label (positive or negative) and a raw text. In order to make computer can understand or "see" the text is by representing the text as number (since computer cannot understand text), so one of the way to represent text as number is by using bag of words (there are other methods to represent text like TF/IDF, WORD2VEC, etc.). So when you train the model using data train, the program should preprocess the raw text, then it should make (in this case) a bag of words map where every element position represent one vocabulary, and it will become 1 or more if the word exist in the text and 0 if it doesn't exist.
Now suppose the training has finished, then the program produce a model, this model is what you save, so whenever you want to test a data, you don't need to re-train the program again. Now when you want to test, yes, you will use the bag of words mapping of the train data, suppose there is a word in the test dataset that never occurred in train dataset, then just map it as 0.
in short:
when you want to test, you have to use the bag of words mapping from the data train
I've been searching for an answer to this specific question for a few hours and while I've learned a lot, I still haven't figured it out.
I have a dataset of ~70,000 sentences with subset of about 4,000 sentences that have been appropriately categorized, the rest are uncategorized. Currently I'm using a scikit pipeline with CountVectorizer and TfidfTransformer to vectorize the data, however I'm only vectorizing based off the 4,000 sentences and then testing various models via cross-validation.
I'm wondering if there is a way to use Word2Vec or something similar to vectorize the entire corpus of data and then use these vectors with my subset of 4,000 sentences. My intention is to increase the accuracy of my model predictions by using word vectors that incorporate all of the semantic data in the corpus rather than just data from the 4,000 sentences.
The code I'm currently using is:
svc = Pipeline([('vect', CountVectorizer(ngram_range=(3, 5))),
('tfidf', TfidfTransformer()),
('clf', LinearSVC()),
])
nb.fit(X_train, y_train)
y_pred = svc.predict(X_test)
Where X_train and y_train are my features and labels, respectively. I also have a list z_all which includes all remaining uncategorized features.
Just getting pointed in the right direction (or told whether or not this is possible) would be super helpful.
Thank you!
I would say that the answer is yes: you can use Word2Vec or another similar word-embedding method to get vectors of each sentence in your data, and then use these vectors both as training and testing data in a linear Support Vector Machine (SVC).
And yes, you can first create those vectors for your entire corpus of ~70,000 sentences before actually doing any training on your data.
It is however not as straightforward as the approach you're currently using.
There are many different ways to do this so I'll just go through one of them to help you get the basics of how this can be done.
Before we start and see what possible steps you can follow, let's remember that the goal here is to get one vector for each and every sentence of your corpus.
If you don't know what word-embeddings are, I highly suggest you to read about it, but in short this is just a way to link each word of a pre-defined vocabulary to a vector of a given dimension.
For instance, you would have:
# the vector associated with the word "cat" is the following vector of fixed-length
word_embeddings["cat"] = [0.0014, 0.6710, ..., 0.3281]
Now that you know this, here are the steps you could be following:
Tokenization - The first thing that you want to do is to tokenize each of your sentences. This can be done using a NLP library (SpaCy for instance) that will help you to:
split each sentence in a list of words
remove any punctuation from these words and converting them to lowercase
remove stopwords - optionally
lemmatize all the words - optionally
Train a word embedding model - Now that you have each sentence as a pre-processed list of words, you need to train a word-embedding model using your corpus. There are many different algorithms to do that. I would suggest using GenSim and Word2Vec or fastText. What you can also do is using pre-trained word embeddings, like GloVe or anything that best fits your corpus in terms of language/context. Either way, this will allow you to:
have one vector of pre-defined size for each and every word in your corpus' vocabulary
get a list of equally-sized vectors for each sentence in your corpus
Adopting a weighting method - Once you have a list of vectors for each sentence in your corpus, and mainly because your sentences vary in length (some have 6 words, some others have 13 words, etc.) what you want to do is getting a single vector for each and every sentence. To do this, what you can do is simply weighting the vectors corresponding to the words in each sentence. You can:
average all vectors
using weights like TF-IDF weights to give some words more importance than others
use other weighting methods...
Training and testing - Finally, all you're left to do is training a model using these vectors, for instance with a linear Support Vector Machine (SVC), and testing the accuracy of your model on a test dataset (you can also use a validation dataset).
My opinion is, if you are going to use a word2vec embedding, use one pre-trained or used generic text to generate it.
Word2vec embedding are usually used to give meaning and context to your text data, if you train an embedding using only your data, it might be biased and not represent a language. And that means it vectors doesn't carry any meaning.
After having your embedding working, you also has to think about what to do with your words, because a sentence has 1 or more words (embedding works at word level), and you want to feed your models with 1 sentence -> 1 vector. not 1 sentences -> N vectors.
People usually average or multiply those vectors so for example, for the sentence "Hello there" and an embedding of 5 dims:
Hello -> [0, 0, .2, .3, .8]
there -> [.1, .2, 0, 0, .5]
AVG Hello there -> [.05, .1, .1, .15, .65]
This is what you want to use for your models!
So instead of using TF-IDF to generate your sentence vectors, use word2vec like this and you shouldn't have any problem. I already work in a text calssification project and we ended usind a self-trained w2v embedding an ExtraTrees model with amazing results.