my data contains sequence of letters for my classification problem. I can turn these sequences to numeric data using kmer (3 letter words are formed), join them and using countvectoriser (how many times the word appears in the sequence instance), I get the matrix of numbers.
I do split the data using train_test_split function.
As we know at the training time, there should not be any information of the test data. If the countvectoriser is fitted on the whole data the unique words from the test would also be known.
So am I correct in saying, countvectoriser needs to b fitted on the train data (unique words only from train data) and using this cv, transform the train and test data?
Yes you are right you don't want to leak any information from the test data to the train data so "countvectoriser needs to b fitted on the train data (unique words only from train data) and using this cv, transform the train and test data" is the right practice.
Related
Wish to fine-tune SentenceTransformer model with multi-class labeled dataset for text classification.
Tutorials seen so far need a specific format as a training data, such as list of positive triplets such as (senetnce1, sentence2, 1) and list of negative triplets such as (senetnce1, senetnce3, 0).
A typical classification dataset is not like that. Its a list of (senetnce1, class1), (senetnce2, class2), (senetence3, class1), (senetnce4, class3), etc.
Is there any ready logic/code/tutorial which will demonstrate, given a typical classification dataset, generate necessary triplet lists, by permutations and combinations? and then train SentenceTransformer successfully, and hopefully with better accuracy?
If you have small number of samples, ie. for few-shots-training, SetFit can be used
If you have large number of samples for fine-tuning, there is unsupervised way called TSDAE.
I am a beginner using Keras, I am trying to preprocess data for training in order to build a neural network. However, I was told that from the csv file where I am getting my data from, the first 6 columns are the x values while the rest are y values. How can I deal with this situation in order to split the data correctly for training and testing. The data is all numerical, it is not categorical. It will be used to predict movement.
When splitting data into training and testing, you aren't splitting along the columns, you're splitting along the rows, so both the training and test sets will have identical columns, but different rows.
You can use scikit-learns train_test_split (docs) to do this for you. So to create an 80-20 split you would do:
df = pd.read_csv(<path to csv>)
train, test = train_test_split(df, test_size=0.20, shuffle=True, random_state=42)
Note that in the docs the example splits the label column out too, however you don't need to do this if you wish to keep the labels and features together.
The random_state parameter (choose any number you like) just ensures that when you re-run the code, the split will be exactly the same (i.e. it is reproducible) each time.
After i did a lot of research about AI and sentiment analysis i found 2 ways to do text analysis.
After the pre-processing for text is done we must create a classification in order to get the positive and negative, so my question is it better to have example:
first way:
100 records of text to train that includes 2 fields text &
status filed that indicate if its positive 1 or negative 0.
second way:
100 records of text to train and make a vocabulary for bag of word in order to train and compare the tested records based on this bag of word.
if i am mistaking in my question please tel me and correct my question.
I think you might miss something here, so to train a sentiment analysis model, you will have a train data which every row has label (positive or negative) and a raw text. In order to make computer can understand or "see" the text is by representing the text as number (since computer cannot understand text), so one of the way to represent text as number is by using bag of words (there are other methods to represent text like TF/IDF, WORD2VEC, etc.). So when you train the model using data train, the program should preprocess the raw text, then it should make (in this case) a bag of words map where every element position represent one vocabulary, and it will become 1 or more if the word exist in the text and 0 if it doesn't exist.
Now suppose the training has finished, then the program produce a model, this model is what you save, so whenever you want to test a data, you don't need to re-train the program again. Now when you want to test, yes, you will use the bag of words mapping of the train data, suppose there is a word in the test dataset that never occurred in train dataset, then just map it as 0.
in short:
when you want to test, you have to use the bag of words mapping from the data train
I have data with 2 important columns, Product Name and Product Category. I wanted to classify a search term into a category. The approach (in Python using Sklearn & DaskML) to create a classifier was:
Clean Product Name column for stopwords, numbers, etc.
Create 90% 10% train-test split
Convert text to vector using OneHotEncoder
Create classifier (Naive Bayes) on the training data
Test the classifier
I realized the OneHotEncoder (or any encoder) converts the text to numbers by creating a matrix keeping into account where and how many times a word occurs.
Q1. Do I need to convert from Word to Vectors before train-test split or after train-test split?
Q2. When I will search for new words (which may not be in the text already), how will I classify it because if I encode the search term, it will be irrelevant to the encoder used for the training data. Can anybody help me with the approach so that I can classify a search term into a category if the term doesn't exist in the training data?
Q1. Do I need to convert from Words to Vectors before train-test split?
Answer: Every algorithm takes input as some number representation of the inputs, so you have to convert from words to vectors. There is no alternative to this. Apart from OneHotEncode, there are other approaches like CountVectorizer and TfIdf-Vectorizer which are recommended to use instead of OneHotEncoding. You can read more about them here .
I am leaning NLP and noticed that TextBlob classification based in Naive Bayes (textblob is Build on top of NLTK) https://textblob.readthedocs.io/en/dev/classifiers.html works fine when training data is list of sentences and does not work at all when training data are individual words (where each word and assigned classification).
Why?
Because you don't have single words in the training data.
Usually the training and evaluation/testing data are supposed to be selected with identical distribution. Biases or skews are usually problematic. In very few cases you can train the model to do one thing and use it to do something else.
In your case, the model likely spreads the weights over the words in the sentence. So when you pick a single word, you only get a small portion of the weight represented.
To get it to work you should add single word examples to your training data.