I'm not good at machine learning. Can someone tell me how to doing text classification with pseudo labeling in python? I never know the right implementation, I have searched everywhere in internet, but I give up as found anything :'( I just found the implementation for numeric datasets, but I found no implementation for text classification (vectorized text).. So I wrote this syntax, but I don't know whether my code is correct or not. Am I doing wrong? Please help me guys, I really need your help.. :'(
This is my datasets if you wanna try. I want to classify 'Label' from 'Content'
My steps are:
Split data 0.75 unlabeled, 0.25 labeled
From 0.25 labeld I split: 0.75 train labeled, and 0.25 test labeled
Make vectorizer for train, test and unlabeled datasets
Build first model from train labeled, then labelling the unlabeled datasets
Concatting train labeled data with prediction of unlabeled that have >0.99 (pseudolabeled), and make the second model
Remove pseudolabeled from unabeled datasets
Predict the remaining unlabeled from second model, then iterate step 3 until the probability of predicted pseudolabeled <0.99.
This is my code:
Performing pseudo labelling on text classification
from sklearn.naive_bayes import MultinomialNB
# Initiate iteration counter
iterations = 0
# Containers to hold f1_scores and # of pseudo-labels
train_f1s = []
test_f1s = []
pseudo_labels = []
# Assign value to initiate while loop
high_prob = [1]
# Loop will run until there are no more high-probability pseudo-labels
while len(high_prob) > 0:
# Set the vector transformer (from data train)
columnTransformer = ColumnTransformer([
('tfidf',TfidfVectorizer(stop_words=None, max_features=100000),
'Content')
],remainder='drop')
def transforms(series):
before_vect = pd.DataFrame({'Content':series})
vector_transformer = columnTransformer.fit(pd.DataFrame({'Content':X_train}))
return vector_transformer.transform(before_vect)
X_train_df = transforms(X_train);
X_test_df = transforms(X_test);
X_unlabeled_df = transforms(X_unlabeled)
# Fit classifier and make train/test predictions
nb = MultinomialNB()
nb.fit(X_train_df, y_train)
y_hat_train = nb.predict(X_train_df)
y_hat_test = nb.predict(X_test_df)
# Calculate and print iteration # and f1 scores, and store f1 scores
train_f1 = f1_score(y_train, y_hat_train)
test_f1 = f1_score(y_test, y_hat_test)
print(f"Iteration {iterations}")
print(f"Train f1: {train_f1}")
print(f"Test f1: {test_f1}")
train_f1s.append(train_f1)
test_f1s.append(test_f1)
# Generate predictions and probabilities for unlabeled data
print(f"Now predicting labels for unlabeled data...")
pred_probs = nb.predict_proba(X_unlabeled_df)
preds = nb.predict(X_unlabeled_df)
prob_0 = pred_probs[:,0]
prob_1 = pred_probs[:,1]
# Store predictions and probabilities in dataframe
df_pred_prob = pd.DataFrame([])
df_pred_prob['preds'] = preds
df_pred_prob['prob_0'] = prob_0
df_pred_prob['prob_1'] = prob_1
df_pred_prob.index = X_unlabeled.index
# Separate predictions with > 99% probability
high_prob = pd.concat([df_pred_prob.loc[df_pred_prob['prob_0'] > 0.99],
df_pred_prob.loc[df_pred_prob['prob_1'] > 0.99]],
axis=0)
print(f"{len(high_prob)} high-probability predictions added to training data.")
pseudo_labels.append(len(high_prob))
# Add pseudo-labeled data to training data
X_train = pd.concat([X_train, X_unlabeled.loc[high_prob.index]], axis=0)
y_train = pd.concat([y_train, high_prob.preds])
# Drop pseudo-labeled instances from unlabeled data
X_unlabeled = X_unlabeled.drop(index=high_prob.index)
print(f"{len(X_unlabeled)} unlabeled instances remaining.\n")
# Update iteration counter
iterations += 1
I think I'm doing something wrong.. Because when I see the f1 scores it is decreasing. Please help me guys :'( I'm stressed.
f1 scores image
=================EDIT=================
So I've search on journal, then I think that I've got misunderstanding about the concept of data splitting in pseudo-labelling.
I initially thought that, the steps starts from splitting the data into labeled and unlabeled data, then from that labeled data, it was splitted into train and test.
But after surfing and searching, I found in this journal that my steps is incorrect. This journal says that the steps pseudo-labeling should start from splitting the data into train and test sets at first, and then from that train sets, data is splited to labeled and unlabeled datasets.
According to that journal, it reach the best result when splitting data into 90% of train sets and 10% of test sets. Then, from that 90% train set, it is splitted into 20% labeled data and 80% unlabeled data sets. This journal trying evidence range from 0.7 till 0.9 as boundary to drop the pseudo labeling, and on that proportion of splitting, the best evidence threshold value is 0.74. So I fix my steps with that new proportion and 0.74 threshold, and I finally got the F1 scores is increasing. Here are my steps:
Split data 0.9 train, 0.1 test sets (I labeled the test sets, so I can measure the f1 scores)
From 0.9 train, I split: 0.2 labeled, and 0.8 unlabeled data
Making vectorizer for X value of labeled train, test and unlabeled training datasets
Build first model from labeled train, then labeling the unlabeled training datasets. Then measure the F-1 scores according to the test sets (that already labeled).
Concatting train labeled data with prediction of unlabeled that have probability > 0.74 (threshold based on journal). We call this new data as pseudo-labelled, likened to the actual label), and make the second model from new train data sets.
Remove selected pseudo-labelled from unlabeled datasets
Use the second model to predict the remaining of unlabeled data, then iterate step 3 until there are no probability of predicted pseudo-labelled>0.74
So the last model is the final.
My syntax is still the same, I just changing the split proportion and I finally got my f1 scores increasing through 4 iterations: my new f1 scores.
Am I doing something right? Thank you for all of your attention guys.. So much thank you..
I'm not good at machine learning.
Overall I would say that you are quite good at Machine Learning: semi-supervised learning is an advanced type of problem and I think your solution is quite good. At least the general principle seems correct, but it's difficult to say for sure (I don't have time to analyze the code in detail sorry). A few comments:
One thing which might be improvable is the 0.74 threshold: this value certainly depends on the data, so you could do your own experiment by trying different threshold values and selecting the one which works best with your data.
Preferably it would be better to keep a final test set aside and use a separate validation set during the iterations. This would avoid the risk of data leakage.
I'm not sure about the stop condition for the loop. It might be ok but it might be worth trying other options:
Simply iterate a fixed number of times (for instance 10 times).
The stop condition could be based on "no more F1-score improvement" (i.e. stabilization of the performance), but it's a bit more advanced.
It's pretty good anyway, my comments are just ideas if you want to improve further. Note that it's been a long time since I've work with semi-supervised, I'm not sure I remember everything very well ;)
Related
I am building a quant model that takes a bunch of features and predicts performance of an index. The model is doing exceptionally well which obviously makes me wonder If I am making some mistake.
I have looked at the underlying features that I am using to ensure there is no data leakage. So now my attention is turning towards my code. Below is the main body of code that I use for prediction.
Does anything look wrong in the looping or how I am predicting? Please let me know if you need any more information and I will share what I can share.
X -> Features used in model training and prediction
y -> Class variable (1,0)
n_record -> Number of records in the dataset
n_train -> Amount of data to use for training in the rolling window construct
model -> Ensemble model from sklearn
My training data is c4500 records. I used n_train of 800 to train the first instance of the model and then a rolling window of 800 for training to predict the 801st instance (and so on). So in that way I roll through time leaving out very old data (keeping the model "current").
col_names = ['Pred', 'Actual', 'Pred Prob'] #Column names for prediction output dataframe
def Strategy (n_train):
list_ans = []
n_records = len(X) #Number of records in X
for i in range(n_train, n_records):
# creating a rolling window to train model on backward data (n_train records) and predict tomorrows performance
X_train, X_test, y_train, y_test = X[i-n_train:i], X[i:i+1], y[i-n_train:i], y[i:i+1]
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
model.fit(X_train,y_train)
Pred=model.predict(X_test)
Actual = y_test.values
Prob = model.predict_proba(X_test)[:,1]
i_ans = [Pred.item(), Actual.item(), Prob.item()]
resi = pd.Series(data=i_ans, index=col_names)
list_ans.append(resi)
return pd.DataFrame(list_ans)
For 1. What values do you expect from n_record or n_train? keep in mind that n_train is acting as the min value for the range. I Don't know if this is how it should be, but be careful, you may be skipping training data.
Apart from that, it's good on my eyes!
There is a deep learning based model using Transfer Learning and LSTM in this article, that author used 10 fold cross validation (as explained in table 3) and took the average of results.
I am familiar with 10 fold cross validation as we need to divide the data and pass to the model, however in this code(here) I can't figure out how to partition data and pass it.
There is two train/test/dev datasets (one for emotion analysis, and one for sentiment analysis we use both for transfer learning, but my focus is on emotion analysis). The raw data is in couple of files in txt format, and after running the model, it gives two new txt files, one for predicted labels, one for true labels.
There is a line of code in the main file:
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
if args.mode=='train':
model.train(data)
sess = model.restore_last_session()
model.predict(data, sess)
if args.mode=='test':
sess = model.restore_last_session()
model.predict(data, sess)
in which the 'data' is a class of Data(code) that includes test/train/dev datasets:
which I think I need to pass the divided data here. If I am right, how can I do partitioning and perform 10 fold cross validation?
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
class Data(object):
def __init__(self,data_path,vocab_path,pretrained,batch_size):
self.batch_size = batch_size
data, vocab ,pretrained= self.load_vocab_data(data_path,vocab_path,pretrained)
self.train=data['train']
self.valid=data['valid']
self.test=data['test']
self.train2=data['train2']
self.valid2=data['valid2']
self.test2=data['test2']
self.word_size = len(vocab['word2id'])+1
self.max_sent_len = vocab['max_sent_len']
self.max_topic_len = vocab['max_topic_len']
self.word2id = vocab['word2id']
word2id = vocab['word2id']
#self.id2word = dict((v, k) for k, v in word2id.iteritems())
self.id2word = {}
for k, v in six.iteritems(word2id):
self.id2word[v]=k
self.pretrained=pretrained
by the look of it, seems the train method can get the session and continue to train from existing model def train(self, data, sess=None)
so with a very minimal changes to existing code and libraries you can do smth like
first load all the data and build the model
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
then create the cross validation data set, smth like
def get_new_data_object():
return data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
cross_validation = []
for i in range(10):
tmp_data = get_new_data_object()
tmp_data.train= #get 90% of tmp_data['train']
tmp_data.valid= #get 90% of tmp_data['valid']
tmp_data.test= #get 90% of tmp_data['test']
tmp_data.train2= #get 90% of tmp_data['train2']
tmp_data.valid2= #get 90% of tmp_data['valid2']
tmp_data.test2= #get 90% of tmp_data['test2']
cross_validation.append(tmp_data)
than run the model n times (10 for 10-fold cross validation)
sess = null
for data in cross_validation:
model.train(data, sess)
sess = model.restore_last_session()
keep in mind to pay attention to some key ideas
I don't know how your data is structured exactly but that effect the way of splitting it to test, train and (in your case) valid
the splitting of data has to be the exact split for each triple of test, train and valid, it can be done randomly or taking different part every time, as long it consistent
you can train the model n times with cross validation or create n models and pick the best to avoid overfitting
this code is just a draft, you can implement it how you would like, there are some great library that already implemented such functionality, and of course can be optimize (not reading the whole data files each time)
one more consideration is to separate the model creation from the data, especially the data arg of the model constructor, from a quick look it seems it only use the dimension of the data, so its a good practice not to pass the whole object
more over, if the model integrate other properties of the data object in it's state (when creating), like the data itself, my code might not work and a more surgical approach
hope it helps, and point you in the right direction
I am building a sentiment analysis model using NLTK and scikitlearn. I have decided to test a few different classifiers in order to see which is most accurate, and eventually use all of them as a means of producing a confidence score.
The datasets used for this testing were all reviews, labelled as either positive or negative.
I trained each classifier with 5,000 reviews, 5 separate times, with 6 different (but very similar) datasets. Each test was done with a new set of 5000 reviews.
I averaged the accuracy for each test and dataset, to arrive at an overall mean accuracy. Take a look:
Multinomial Naive Bayes: 91.291%
Logistic Regression: 96.103%
SVC: 95.844%
In some tests, the accuracy was as high as 99.912%. In fact, the lowest mean accuracy for one of the datasets was 81.524%.
Here's a relevant code snippet:
def get_features(comment, word_features):
features = {}
for word in word_features:
features[word] = (word in set(comment))
return features
def main(dataset_name, column, limit):
data = get_data(column, limit)
data = clean_data(data) # filter stop words
all_words = [w.lower() for (comment, category) in data for w in comment]
word_features = nltk.FreqDist(all_words).keys()
feature_set = [(get_features(comment, word_features), category) for
(comment, category) in data]
run = 0
while run < 5:
random.shuffle(feature_set)
training_set = feature_set[:int(len(data) / 2.)]
testing_set = feature_set[int(len(data) / 2.):]
classifier = SklearnClassifier(SVC())
classifier.train(training_set)
acc = nltk.classify.accuracy(classifier, testing_set) * 100.
save_acc(acc) # function to save results as .csv
run += 1
Although I know that these kinds of classifiers can typically return great results, this seems a little too good to be true.
What are some things that I need to check to be sure this is valid?
It's not so good if you get a range from 99,66% to 81,5%.
To analyze dataset in case of text classification, you can check:
If the dataset is balanced?
Distribution words for each label, sometimes the vocabulary used for each label can be really different.
Positive/negative, but for the same source? Like the point before maybe if the domain is not the same, the reviews can use different expressions for a positive o negative review. This helps to get a high accuracy in several source.
Try with a review from different source.
If after all you get that high accuracy, congrat! your get_features is really good. :)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am new to Machine Learning and to Python. I am trying to build a Random Forest Regression model on one of the datasets from the UCI repository. This is my first ML model. I may be entirely wrong in my approach.
The dataset is available here - https://archive.ics.uci.edu/ml/datasets/abalone
Below is the entire working code that I have written. I am using Python 3.6.4 with Windows 7 x64 OS (forgive me for the lengthy code).
import tkinter as tk # Required for enabling GUI options
from tkinter import messagebox # Required for pop-up window
from tkinter import filedialog # Required for getting full path of file
import pandas as pd # Required for data handling
from sklearn.model_selection import train_test_split # Required for splitting data into training and test set
from sklearn.ensemble import RandomForestRegressor # Required to build random forest
#------------------------------------------------------------------------------------------------------------------------#
# Create an instance of tkinter and hide the window
root = tk.Tk() # Create an instance of tkinter
root.withdraw() # Hides root window
#root.lift() # Required for pop-up window management
root.attributes("-topmost", True) # To make pop-up window stay on top of all other windows
#------------------------------------------------------------------------------------------------------------------------#
# This block of code reads input file using tkinter GUI options
print("Reading input file...")
# Pop up window to ask user the input file
File_Checker = messagebox.askokcancel("Random Forest Regression Prompt",
"At The Prompt, Enter 'Abalone_Data.csv' File.")
# Kill the execution if user selects "Cancel" in the above pop-up window
if (File_Checker == False):
quit()
else:
del(File_Checker)
file_loop = 0
while (file_loop == 0):
# Get path of base file
file_path = filedialog.askopenfilename(initialdir = "/",
title = "File Selection Prompt",
filetypes = (("XLSX Files","*.*"), ))
# Condition to check if user selected a file or not
if (len(file_path) < 1):
# Pop-up window to warn uer that no file was selected
result = messagebox.askretrycancel("File Selection Prompt Error",
"No file has been selected. \nWhat do you want to do?")
# Condition to repeat the loop or quit program execution
if (result == True):
continue
else:
quit()
# Get file name
file_name = file_path.split("/") # Splits the file with "/" as the delimiter and returns a list
file_name = file_name[-1] # extracts the last element of the list
# Condition to check if correct file was selected or not
if (file_name != "Abalone_Data.csv"):
result = messagebox.askretrycancel("File Selection Prompt Error",
"Incorrect file selected. \nWhat do you want to do?")
# Condition to repeat the loop or quit program execution
if (result == True):
continue
else:
quit()
# Read the base file
input_file = pd.read_csv(file_path,
sep = ',',
encoding = 'utf-8',
low_memory = True)
break
# Delete unwanted files
del(file_loop, file_name)
#------------------------------------------------------------------------------------------------------------------------#
print("Preparing dependent and independent variables...")
# Create Separate dataframe consisting of only dependent variable
y = pd.DataFrame(input_file['Rings'])
# Create Separate dataframe consisting of only independent variable
X = input_file.drop(columns = ['Rings'], inplace = False, axis = 1)
#------------------------------------------------------------------------------------------------------------------------#
print("Handling Dummy Variable Trap...")
# Create a new dataframe to handle categorical data
# This method splits the dategorical data column into separate columns
# This is to ensure we get rid of the dummy variable trap
dummy_Sex = pd.get_dummies(X['Sex'], prefix = 'Sex', prefix_sep = '_', drop_first = True)
# Remove the speciic columns from the dataframe
# These are the categorical data columns which split into separae columns in the previous step
X.drop(columns = ['Sex'], inplace = True, axis = 1)
# Merge the new columns to the original dataframe
X = pd.concat([X, dummy_sex], axis = 1)
#------------------------------------------------------------------------------------------------------------------------#
y = y.values
X = X.values
#------------------------------------------------------------------------------------------------------------------------#
print("Splitting datasets to training and test sets...")
# Splitting the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#------------------------------------------------------------------------------------------------------------------------#
print("Fitting Random Forest regression on training set")
# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message
#------------------------------------------------------------------------------------------------------------------------#
print("Predicting Values")
# Predicting a new result with regression
y_pred = regressor.predict(X_test)
# Enter values for new prediction as a Dictionary
test_values = {'Sex_I' : 0,
'Sex_M' : 0,
'Length' : 0.5,
'Diameter' : 0.35,
'Height' : 0.8,
'Whole_Weight' : 0.223,
'Shucked_Weight' : 0.09,
'Viscera_Weight' : 0.05,
'Shell_Weight' : 0.07}
# Convert dictionary into dataframe
test_values = pd.DataFrame(test_values, index = [0])
# Rearranging columns as required
test_values = test_values[['Length','Diameter','Height','Whole_Weight','Shucked_Weight','Viscera_Weight',
'Viscera_Weight', 'Sex_I', 'Sex_M']]
# Applying feature scaling
#test_values = sc_X.transform(test_values)
# Predicting values of new data
new_pred = regressor.predict(test_values)
#------------------------------------------------------------------------------------------------------------------------#
"""
print("Building Confusion Matrix...")
# Making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
"""
#------------------------------------------------------------------------------------------------------------------------#
print("\n")
print("Getting Model Accuracy...")
# Get regression details
#print("Estimated Coefficient = ", regressor.coef_)
#print("Estimated Intercept = ", regressor.intercept_)
print("Training Accuracy = ", regressor.score(X_train, y_train))
print("Test Accuracy = ", regressor.score(X_test, y_test))
print("\n")
print("Printing predicted result...")
print("Result_of_Treatment = ", new_pred)
When I look at the model accuracy, below is what I get.
Getting Model Accuracy...
Training Accuracy = 0.9359702279804791
Test Accuracy = 0.5695080680053354
Below are my questions.
1) Why are the Training Accuracy and Test Accuracy so far away?
2) How do I know if this model is being over/under fitted?
3) Is Random forest Regression the right model to use? If no, how do i determine the right model for this use-case?
3) How can I build a confusion matrix using the variables I have created?
4) How do I validate the performance of the model?
I am looking for your guidance so that I too can learn from my mistakes and improve on my modelling skills.
Before trying to answer to your points, a comment: I see you are using a Regressor with accuracy as metric. But accuracy is a metric used in classification problems; in regressions models you usually use other metrics, as Mean Squared Error (MSE). See here.
If you just switch to a more adapt metric, maybe you will find that your model is not so bad.
I’m going anyway to reply to your questions.
Why are the Training Accuracy and Test Accuracy so far away?
This means that you overfitted your training samples: your model is very strong in predicting the data of the training dataset, but unable to generalise. Is like having a model trained on a set of cat pictures which believe only those pictures are cats, and all the other pictures of all the other cats are not. In fact, you have an accuracy on the test set of ~0.5, which is basically a random guess.
How do I know if this model is being over/under fitted?
Exactly form the difference in accuracy between the two sets. The more they are near each other, the more the model is able to generalise. You already know how on overfit looks like. An underfit is generally recognisable because of a low accuracy in both sets.
Is Random forest Regression the right model to use? If no, how do i determine the right model for this use-case?
There is not a right model to use. Random Forest, and in general all the tree-based model (LightGBM, XGBoost) are the Swiss army knife of machine learning when you are dealing with structured data, because of their simplicity and reliability. Model based on Deep Learning perform better in theory, but much more complex to set up.
How can I build a confusion matrix using the variables I have created?
Confusion matrices can be created when you build a classification model, and are built on the output of your model.
You are using a regressor, it do not have lot of sense.
How do I validate the performance of the model?
In general, for a reliable validation of performances you split the data I three: you train on one (a.k.a. training set), tune the model on a second (a.k.a. validation set, this is what you call test set), and finally, when you are happy with the model and its hyper-parameters, you test it on the third (a.k.a. test set, not to be confused with the one you call test set). This last one tells you if your model generalize well or not. This because when you choose and tune the model you can also overfit the validation set (the one you call test set), maybe selecting a set of hyper-parameters which performs well only on that set.
Also, you have to choose a reliable metric, and this depends both on the data and on the model. With regressions, the MSE is pretty good.
With Trees and Ensemble, you have to be carreful of some settings. In your case, the difference comes from "overfitting". That means, your model have learned "too much" your training datas and is not able to generalise to other datas.
One important thing to do is to limit the depth of trees. For every trees, there is a branching factor of 2. That means at depth d, you gonna have 2^d branches.
Let's imagine you have 1000 training values. If you don't limit depth
(or/and min_samples_leaf), you can learn your complete dataset with a
depth of 10 (because 2^10 = 1024 > N_training).
What you can do is to compare training accuracy and test accuracy for a range of depth ( from let's say 3 to log(n) in base 2 ). If depth is too low, both accuracy will be low as you need more branches to learn properly datas, it will rise a peak then the training data will continue to riseup but test values will go down. It should be something like the following picture with Model Complexity which is your depth.
You can also play with min_samples_split and/or min_samples_leaf which can help you to let's say split only if you have multiple datas in this branch. As a result, this will limit also depth and will allow to have a tree with different depth per branches. Same as previously explained, you can play with the value to look for the best values (with a grid Search).
I hope it helps,
I am using Scikit-learn for a binary classification task.. and I have:
Class 0: with 200 observations
Class 1: with 50 observations
And because I have an unbalanced data.. I want to take a random subsample of the majority class where the number of observations will be the same as the minority class and want to use the new obtained dataset as an input to the classifier .. the process of subsampling and classifying can be repeated many times .. I've the following code for the subsampling with mainly the help of Ami Tavory
docs_train=load_files(rootdir,categories=categories, encoding='latin-1')
X_train = docs_train.data
y_train = docs_train.target
majority_x,majority_y=x[y==0,:],y[y==0] # assuming that class 0 is the majority class
minority_x,minority_y=x[y==1,:],y[y==1]
inds=np.random.choice(range(majority_x.shape[0]),50)
majority_x=majority_x[inds,:]
majority_y=majority_y[inds]
It works like a charm, however, at the end of processing the majority_x and majority_y I want to be able to replace the old set that represent class0 in X_train, y_train with the new smaller set in order to pass it as follow to the classifier or the pipeline:
pipeline = Pipeline([
('vectorizer', CountVectorizer( tokenizer=tokens, binary=True)),
('classifier',SVC(C=1,kernel='linear')) ])
pipeline.fit(X_train, y_train)
What I have done In order to solve this:
since the resulted arrays where numpy arrays, and because I am new to the whole area and I am really trying very hard to learn .. I've tried to combine the two resulted arrays together majority_x+minority_x in order to form the training data that I want .. I couldn't it gave some errors which I am trying to solve until now ... but even if I could .. how can I keep their index so the majority_y and minority_y will be true as well !
After processing majority_x and minority_y you can merge your training sets with
X_train = np.concatenate((majority_x,minority_x))
y_train = np.concatenate((majority_y,minority_y))
Now X_train and y_train will first contain the chosen samples with y=0 and then the samples with y=1.
An idea for your related question:
Make your choice of the majority samples by creating a random permutation vector of the length of the number of your majority samples.
Then choose the first 50 indices of that vector, then the next 50 and so on.
When you are through with that vector, each sample will have been chosen exactly once.
If you want more iterations or the remaining permutation vector is too short, you can resort back to random choice.
As I mentioned in my comment, you might also want to add the parameter "replace=False" in your np.random.choice,
if you want to prevent having the same sample multiple times in one iteration.