Subsampling + classifying using scikit-learn

Subsampling + classifying using scikit-learn - python

I am using Scikit-learn for a binary classification task.. and I have:
Class 0: with 200 observations
Class 1: with 50 observations
And because I have an unbalanced data.. I want to take a random subsample of the majority class where the number of observations will be the same as the minority class and want to use the new obtained dataset as an input to the classifier .. the process of subsampling and classifying can be repeated many times .. I've the following code for the subsampling with mainly the help of Ami Tavory
docs_train=load_files(rootdir,categories=categories, encoding='latin-1')
X_train = docs_train.data
y_train = docs_train.target
majority_x,majority_y=x[y==0,:],y[y==0] # assuming that class 0 is the majority class
minority_x,minority_y=x[y==1,:],y[y==1]
inds=np.random.choice(range(majority_x.shape[0]),50)
majority_x=majority_x[inds,:]
majority_y=majority_y[inds]
It works like a charm, however, at the end of processing the majority_x and majority_y I want to be able to replace the old set that represent class0 in X_train, y_train with the new smaller set in order to pass it as follow to the classifier or the pipeline:
pipeline = Pipeline([
('vectorizer', CountVectorizer( tokenizer=tokens, binary=True)),
('classifier',SVC(C=1,kernel='linear')) ])
pipeline.fit(X_train, y_train)
What I have done In order to solve this:
since the resulted arrays where numpy arrays, and because I am new to the whole area and I am really trying very hard to learn .. I've tried to combine the two resulted arrays together majority_x+minority_x in order to form the training data that I want .. I couldn't it gave some errors which I am trying to solve until now ... but even if I could .. how can I keep their index so the majority_y and minority_y will be true as well !

After processing majority_x and minority_y you can merge your training sets with
X_train = np.concatenate((majority_x,minority_x))
y_train = np.concatenate((majority_y,minority_y))
Now X_train and y_train will first contain the chosen samples with y=0 and then the samples with y=1.
An idea for your related question:
Make your choice of the majority samples by creating a random permutation vector of the length of the number of your majority samples.
Then choose the first 50 indices of that vector, then the next 50 and so on.
When you are through with that vector, each sample will have been chosen exactly once.
If you want more iterations or the remaining permutation vector is too short, you can resort back to random choice.
As I mentioned in my comment, you might also want to add the parameter "replace=False" in your np.random.choice,
if you want to prevent having the same sample multiple times in one iteration.

Related

Pseudo Labelling on Text Classification Python

I'm not good at machine learning. Can someone tell me how to doing text classification with pseudo labeling in python? I never know the right implementation, I have searched everywhere in internet, but I give up as found anything :'( I just found the implementation for numeric datasets, but I found no implementation for text classification (vectorized text).. So I wrote this syntax, but I don't know whether my code is correct or not. Am I doing wrong? Please help me guys, I really need your help.. :'(
This is my datasets if you wanna try. I want to classify 'Label' from 'Content'
My steps are:
Split data 0.75 unlabeled, 0.25 labeled
From 0.25 labeld I split: 0.75 train labeled, and 0.25 test labeled
Make vectorizer for train, test and unlabeled datasets
Build first model from train labeled, then labelling the unlabeled datasets
Concatting train labeled data with prediction of unlabeled that have >0.99 (pseudolabeled), and make the second model
Remove pseudolabeled from unabeled datasets
Predict the remaining unlabeled from second model, then iterate step 3 until the probability of predicted pseudolabeled <0.99.
This is my code:
Performing pseudo labelling on text classification
from sklearn.naive_bayes import MultinomialNB
# Initiate iteration counter
iterations = 0
# Containers to hold f1_scores and # of pseudo-labels
train_f1s = []
test_f1s = []
pseudo_labels = []
# Assign value to initiate while loop
high_prob = [1]
# Loop will run until there are no more high-probability pseudo-labels
while len(high_prob) > 0:
# Set the vector transformer (from data train)
columnTransformer = ColumnTransformer([
('tfidf',TfidfVectorizer(stop_words=None, max_features=100000),
'Content')
],remainder='drop')
def transforms(series):
before_vect = pd.DataFrame({'Content':series})
vector_transformer = columnTransformer.fit(pd.DataFrame({'Content':X_train}))
return vector_transformer.transform(before_vect)
X_train_df = transforms(X_train);
X_test_df = transforms(X_test);
X_unlabeled_df = transforms(X_unlabeled)
# Fit classifier and make train/test predictions
nb = MultinomialNB()
nb.fit(X_train_df, y_train)
y_hat_train = nb.predict(X_train_df)
y_hat_test = nb.predict(X_test_df)
# Calculate and print iteration # and f1 scores, and store f1 scores
train_f1 = f1_score(y_train, y_hat_train)
test_f1 = f1_score(y_test, y_hat_test)
print(f"Iteration {iterations}")
print(f"Train f1: {train_f1}")
print(f"Test f1: {test_f1}")
train_f1s.append(train_f1)
test_f1s.append(test_f1)
# Generate predictions and probabilities for unlabeled data
print(f"Now predicting labels for unlabeled data...")
pred_probs = nb.predict_proba(X_unlabeled_df)
preds = nb.predict(X_unlabeled_df)
prob_0 = pred_probs[:,0]
prob_1 = pred_probs[:,1]
# Store predictions and probabilities in dataframe
df_pred_prob = pd.DataFrame([])
df_pred_prob['preds'] = preds
df_pred_prob['prob_0'] = prob_0
df_pred_prob['prob_1'] = prob_1
df_pred_prob.index = X_unlabeled.index
# Separate predictions with > 99% probability
high_prob = pd.concat([df_pred_prob.loc[df_pred_prob['prob_0'] > 0.99],
df_pred_prob.loc[df_pred_prob['prob_1'] > 0.99]],
axis=0)
print(f"{len(high_prob)} high-probability predictions added to training data.")
pseudo_labels.append(len(high_prob))
# Add pseudo-labeled data to training data
X_train = pd.concat([X_train, X_unlabeled.loc[high_prob.index]], axis=0)
y_train = pd.concat([y_train, high_prob.preds])
# Drop pseudo-labeled instances from unlabeled data
X_unlabeled = X_unlabeled.drop(index=high_prob.index)
print(f"{len(X_unlabeled)} unlabeled instances remaining.\n")
# Update iteration counter
iterations += 1
I think I'm doing something wrong.. Because when I see the f1 scores it is decreasing. Please help me guys :'( I'm stressed.
f1 scores image
=================EDIT=================
So I've search on journal, then I think that I've got misunderstanding about the concept of data splitting in pseudo-labelling.
I initially thought that, the steps starts from splitting the data into labeled and unlabeled data, then from that labeled data, it was splitted into train and test.
But after surfing and searching, I found in this journal that my steps is incorrect. This journal says that the steps pseudo-labeling should start from splitting the data into train and test sets at first, and then from that train sets, data is splited to labeled and unlabeled datasets.
According to that journal, it reach the best result when splitting data into 90% of train sets and 10% of test sets. Then, from that 90% train set, it is splitted into 20% labeled data and 80% unlabeled data sets. This journal trying evidence range from 0.7 till 0.9 as boundary to drop the pseudo labeling, and on that proportion of splitting, the best evidence threshold value is 0.74. So I fix my steps with that new proportion and 0.74 threshold, and I finally got the F1 scores is increasing. Here are my steps:
Split data 0.9 train, 0.1 test sets (I labeled the test sets, so I can measure the f1 scores)
From 0.9 train, I split: 0.2 labeled, and 0.8 unlabeled data
Making vectorizer for X value of labeled train, test and unlabeled training datasets
Build first model from labeled train, then labeling the unlabeled training datasets. Then measure the F-1 scores according to the test sets (that already labeled).
Concatting train labeled data with prediction of unlabeled that have probability > 0.74 (threshold based on journal). We call this new data as pseudo-labelled, likened to the actual label), and make the second model from new train data sets.
Remove selected pseudo-labelled from unlabeled datasets
Use the second model to predict the remaining of unlabeled data, then iterate step 3 until there are no probability of predicted pseudo-labelled>0.74
So the last model is the final.
My syntax is still the same, I just changing the split proportion and I finally got my f1 scores increasing through 4 iterations: my new f1 scores.
Am I doing something right? Thank you for all of your attention guys.. So much thank you..

I'm not good at machine learning.
Overall I would say that you are quite good at Machine Learning: semi-supervised learning is an advanced type of problem and I think your solution is quite good. At least the general principle seems correct, but it's difficult to say for sure (I don't have time to analyze the code in detail sorry). A few comments:
One thing which might be improvable is the 0.74 threshold: this value certainly depends on the data, so you could do your own experiment by trying different threshold values and selecting the one which works best with your data.
Preferably it would be better to keep a final test set aside and use a separate validation set during the iterations. This would avoid the risk of data leakage.
I'm not sure about the stop condition for the loop. It might be ok but it might be worth trying other options:
Simply iterate a fixed number of times (for instance 10 times).
The stop condition could be based on "no more F1-score improvement" (i.e. stabilization of the performance), but it's a bit more advanced.
It's pretty good anyway, my comments are just ideas if you want to improve further. Note that it's been a long time since I've work with semi-supervised, I'm not sure I remember everything very well ;)

Pyspark with sklearn and dataframe types conversion

I'm trying to use sklearn with pyspark but I'm having some performance issues. Lets say I have a dataset that have already went through a pipeline where features were vectorized and normalized. In order to use sklearn I'll either have to provide the algorithm an array or a pandas dataframe. Considering the size of my dataset (2.8M + and should grow bigger soon), converting the train set to pandas is painstakingly slow so I'm using a numpy approach:
train = np.array(train.select('features').collect()).squeeze()
This is relatively slow as I need to use collect to push data back to the driver. Is there any other approach that would be faster and better? Additionally, due to the nature of the problem I'm currently handling my scoring function in a non-standard way:
def score (fitmodel, test):
predY = fitmodel.score_samples(test)
return np.full((1, len(predY)), np.mean(predY)).transpose()
The idea is to compute the average of the predictions and then return an array where this result is replicated as many times as the number of records tested. E.g. if my testset has 450 records I'll return an array shaped (450,1) where all 450 records have the same value (average of the predictions). Although slow, so far so good and everything works as intended. My problem is that I need to proceed with the testing by doing this multiple times (changing the test set) and append the results to a single array in order to evaluate the model's performance later. My code:
for _ in tdqm(range(450, 800)):
#Get group #X
_test = df.where(col('index') == _) #Get a different "chunk" of the dataset each iteration
_test.coalesce(2)
#Apply pipeline with transforms
test = pipelineModel.transform(_test)
y_test = np.array(test.select('label').collect())
x_test = np.array(test.select('features').collect()).squeeze()
pred = newScore(x_test, model)
if(_ == (450)): #First run
trueY = y_test
predY = pred
else:
trueY = np.append(trueY, y_test)
predY = np.append(predY, pred)
Briefly, I grab a specific portion of the dataset, test it and then i want to append both the predictions and "true" label for later evaluation. My main problem here is that doing 2 collects and a np.append takes a long time and I need to find an alternative. Testing the entire test set (about 400k entries) takes me ~1min but with all these in place it increases the time to 2h20min.
On top of this I have to convert the array back to a pyspark dataframe to use the mllib evaluation functions which adds a bit more time to the process.
With all this being said, could anyone point me in a direction where I could accomplish this in a more efficient way? Maybe there is another way to use spark and sklearn ?

How to train a CNN on an unlabeled dataset?

I want to train a CNN on my unlabeled data, and from what I read on Keras/Kaggle/TF documentation or Reddit threads, it looks like I will have to label my dataset beforehand. Is there a way to train the CNN in an unsupervised way?
I cannot understand how to initialize y_train and y_test (where y_train and y_test represent usual meanings)
The information about my dataset is as follows:
I have 50,000 matrices of dimension 30 x 30.
Each matrix is divided into 9 subareas (for understanding, as separated by the vertical and horizontal bars).
A subarea is said to be active if it has at least one element equal to 1. If all elements for that subarea are equal to 0, the subarea is inactive.
For the first example shown below, I should get as output the names of subareas that are active, so here, (1, 4, 5, 6, 7, 9).
If no subarea is active, as in the second example, the output should be 0.
First example: Output - (1, 4, 5, 6, 7, 9)
Second example: Output - 0
After creating these matrices, I did the following:
I put these matrices in a CSV file after reshaping them into vectors of dimension 900 x 1.
So basically, each row in the CSV contains 900 columns with values either 0 or 1.
The classes for my classification problem are numbers from 0-9 where 0 represents the class where no label has an active (value=1) value.
For my model, I want the following:
Input: a 900 x 1 vector as described above.
Output: one of the values from 0-9, where 1-9 represent the active subareas, and 0 represents no active subarea.
What I have done:
I am able to retrieve the data from the CSV file into a data frame and split the data frame into x_train and x_test. But I am unable to understand how to set my y_train and y_test values.
My problem seems very similar to the MNIST dataset, except I don't have the labels. Would it be possible for me to train the model without the labels?
My code currently looks like this:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Read the dataset from the CSV file into a dataframe
df = pd.read_csv("bci_dataset.csv")
# Split the dataframe into training and test dataset
train, test = train_test_split(df, test_size=0.2)
x_train = train.iloc[:, :]
x_test = test.iloc[:, :]
print(x_train.shape)
print(x_test.shape)
Thank you, in advance, for reading this whole thing and helping me out!

Can you tell us why you want to use a CNN specifically? Generally neural networks are used when there's some complication involved in going from feature to output - the artificial neurons are able to learn different behavior as a result of being exposed to the ground truth (i.e., the labels). Most of the time, the researcher using the neural network doesn't even know what features of the input data are being used by the network to come to its output conclusions.
In the case you have given us, it looks a little bit more like you know what features are important (that is, the sum of a subarea has to be greater than 0 in order to be active). The neural network wouldn't need to really learn anything in particular to do its job. Although it doesn't seem necessary to use a neural network for this process, it does make sense for you to automate it, given the size of your input data! :)
Let me know if I'm misunderstanding your situation, though?
Edit: To contrast this with the MNIST dataset - so for identifying handwritten digits, there's some ambiguity that the network has to learn to deal with. Not every kind of handwriting is going to render a 7 the same way. A neural network is able to figure out a couple of the features of a 7 (i.e., there is a high probability that a 7 will have a diagonal line going from top-right-to-bottom-left, which, depending on how you write, could be slightly curved or offset or whatever), as well as a couple of different versions of a 7 (some people do a horizontal slash through the middle of it, other versions of a 7 don't have that slash). The utility of a neural network here is in figuring out all that ambiguity and probabilistically classifying an input as a 7 (because it has seen previous images that it "knows" are 7s). However, in your case, there's only one way for your answer to be rendered - if there's any element greater than 0 in a subarea, it's active! So you don't need to train a network to do anything - you will just need to write some code that automates the summing of the subareas.

SKlearn prediction on test dataset with different shape from training dataset shape

I'm new to ML and would be grateful for any assistance provided. I've run a linear regression prediction using test set A and training set A. I saved the linear regression model and would now like to use the same model to predict a test set A target using features from test set B. Each time I run the model it throws up the error below
How can I successfully predict a test data set from features and a target with different shapes?
Input
print(testB.shape)
print(testA.shape)
Output
(2480, 5)
(1315, 6)
Input
saved_model = joblib.load(filename)
testB_result = saved_model.score(testB_features, testA_target)
print(testB_result)
Output
ValueError: Found input variables with inconsistent numbers of samples: [1315, 2480]
Thanks again

They are inconsistent shapes which is why the error is being thrown. Have you tried to reshape the data so one of them are same shape? From a quick look, it seems that you have more samples and one less feature in testA.
Think about it, if you have trained your model with 5 features you cannot then ask the same model to make a prediction given 6 features. You speak of using a Linear Regressor, the equation is roughly:
y = b + w0*x0 + w1*x1 + w2*x2 + .. + wN-1*xN-1
Where {
y is your output/label
N is the number of features
b is the bias term
w(i) is the ith weight
x(i) is the ith feature value
}
You have trained a linear regressor with 5 features, effectively producing the following
y (your output/label) = b + w0*x0 + w1*x1 + w2*x2 + w3*x3 + w4*x4
You then ask it to make a prediction given 6 features but it only knows how to deal with 5.
Aside from that issue, you also have too many samples, testB has 2480 and testA has 1315. These need to match, as the model wants to make 2480 predictions, but you only give it 1315 outputs to compare it to. How can you get a score for 1165 missing samples? Do you now see why the data has to be reshaped?
EDIT
Assuming you have datasets with an equal amount of features as discussed above, you may now look at reshaping (removing data) testB like so:
testB = testB[0:1314, :]
testB.shape
(1315, 5)
Or, if you would prefer a solution using the numpy API:
testB = np.delete(testB, np.s_[0:(len(testB)-len(testA))], axis=0)
testB.shape
(1315, 5)
Keep in mind, when doing this you slice out a number of samples. If this is important to you (which it can be) then it may be better to introduce a pre-processing step to help out with the missing values, namely imputing them like this. It is worth noting that the data you are reshaping should be shuffled (unless it is already), as you may be removing parts of the data the model should be learning about. Neglecting to do this could result in a model that may not generalise as well as you hoped.

how to set the number of features to use in random selection sklearn

I am using sklearn RandomForest Classifier/Bag classifier for learning and I am not getting the expected results when compared to Java/Weka Machine Learning library.
In Weka, I am learning the model with - Random forest of 10 trees, each constructed while considering 6 random features. (setNumFeatures need to be set and default is 10 trees)
In sklearn - I am not sure how to specify the number of features to randomly consider while constructing a random forest of 10 trees. This what I am doing:
rf_classifier = RandomForestClassifier(n_estimators=num_trees, max_features=6)
rf_classifier = rf_classifier.fit(train_file, train_file_label)
for items in rf_classifier.estimators_:
classifier_list.append(items)
I saw the docs and there is a parameter - max_features but I am not sure if that serves the purpose. I get this error when I am trying to calculate entropy:
# code to calculate voting entropy for all features (unlabeled data)
vote_count_for_features = list(classifier_list[0].predict(feature_data_arr))
for i in range(1, len(classifier_list)):
res_temp = []
res_temp = list(classifier_list[i].predict(feature_data_arr))
vote_count_for_features = [x + y for x, y in zip(vote_count_for_features, res_temp)]
If I set that parameter to 6, than my code fails with the error message:
Number of features of the model must match the input. Model n_features
is 6 and input n_features is 31
Inputs: Sample set of 1 million records with 31 features. When I run weka, the number of rules extracted are around 1000 whereas when I run the same thing through sklearn - I get hardly 70 rules.
I am new to python and sklearn and I am trying to figure out where am I doing wrong. (Weka code has been tested well and gives 95% precision, 80% recall - so I am assuming that's good)
Note: I have used sklearn imputer to impute missing values using 'mean' strategy whereas Weka has ways to handle NaN.
This is what I am trying to achieve: Learn Random Forest on a sample set, extract rules, evaluate rules and then apply on the bigger set
Any suggestions or input will really help me debug through the issue and solve it quickly.

I think the issue is that the individual trees get confused since they only use 6 features, but you give them 31. You can try to get the prediction to work by setting check_input = False:
list(classifier_list[i].predict(feature_data_arr, check_input = False))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.