10 fold cross validation python - python

There is a deep learning based model using Transfer Learning and LSTM in this article, that author used 10 fold cross validation (as explained in table 3) and took the average of results.
I am familiar with 10 fold cross validation as we need to divide the data and pass to the model, however in this code(here) I can't figure out how to partition data and pass it.
There is two train/test/dev datasets (one for emotion analysis, and one for sentiment analysis we use both for transfer learning, but my focus is on emotion analysis). The raw data is in couple of files in txt format, and after running the model, it gives two new txt files, one for predicted labels, one for true labels.
There is a line of code in the main file:
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
if args.mode=='train':
model.train(data)
sess = model.restore_last_session()
model.predict(data, sess)
if args.mode=='test':
sess = model.restore_last_session()
model.predict(data, sess)
in which the 'data' is a class of Data(code) that includes test/train/dev datasets:
which I think I need to pass the divided data here. If I am right, how can I do partitioning and perform 10 fold cross validation?
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
class Data(object):
def __init__(self,data_path,vocab_path,pretrained,batch_size):
self.batch_size = batch_size
data, vocab ,pretrained= self.load_vocab_data(data_path,vocab_path,pretrained)
self.train=data['train']
self.valid=data['valid']
self.test=data['test']
self.train2=data['train2']
self.valid2=data['valid2']
self.test2=data['test2']
self.word_size = len(vocab['word2id'])+1
self.max_sent_len = vocab['max_sent_len']
self.max_topic_len = vocab['max_topic_len']
self.word2id = vocab['word2id']
word2id = vocab['word2id']
#self.id2word = dict((v, k) for k, v in word2id.iteritems())
self.id2word = {}
for k, v in six.iteritems(word2id):
self.id2word[v]=k
self.pretrained=pretrained

by the look of it, seems the train method can get the session and continue to train from existing model def train(self, data, sess=None)
so with a very minimal changes to existing code and libraries you can do smth like
first load all the data and build the model
data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
model = BiLstm(args, data, ckpt_path='./' + args.data_name + '_output/')
then create the cross validation data set, smth like
def get_new_data_object():
return data = Data('./data/'+args.data_name+'data_sample.bin','./data/'+args.data_name+'vocab_sample.bin',
'./data/'+args.data_name+'word_embed_weight_sample.bin',args.batch_size)
cross_validation = []
for i in range(10):
tmp_data = get_new_data_object()
tmp_data.train= #get 90% of tmp_data['train']
tmp_data.valid= #get 90% of tmp_data['valid']
tmp_data.test= #get 90% of tmp_data['test']
tmp_data.train2= #get 90% of tmp_data['train2']
tmp_data.valid2= #get 90% of tmp_data['valid2']
tmp_data.test2= #get 90% of tmp_data['test2']
cross_validation.append(tmp_data)
than run the model n times (10 for 10-fold cross validation)
sess = null
for data in cross_validation:
model.train(data, sess)
sess = model.restore_last_session()
keep in mind to pay attention to some key ideas
I don't know how your data is structured exactly but that effect the way of splitting it to test, train and (in your case) valid
the splitting of data has to be the exact split for each triple of test, train and valid, it can be done randomly or taking different part every time, as long it consistent
you can train the model n times with cross validation or create n models and pick the best to avoid overfitting
this code is just a draft, you can implement it how you would like, there are some great library that already implemented such functionality, and of course can be optimize (not reading the whole data files each time)
one more consideration is to separate the model creation from the data, especially the data arg of the model constructor, from a quick look it seems it only use the dimension of the data, so its a good practice not to pass the whole object
more over, if the model integrate other properties of the data object in it's state (when creating), like the data itself, my code might not work and a more surgical approach
hope it helps, and point you in the right direction

Related

Batching large input file into MLlib model

Is there any way to batch a large input file (111MB) made of 22 MLN cells (222 rows for 110k columns) in MLlib (something similar to this tutorial made in keras) Keras batching tutorial.
The file contains the actual features extracted from 222 images using the above tutorial, but instead of using a keras model I would like to replicate such code using pyspark and MLlib.
Unfortunately I've not enough resources for dealing in memory for such big file and the computation fails for Java Heap Space memory error.
The file structure is composed by for each row (representing an image) we have these columns: "_c0" the label 0/1, from "_c1" up to "_c100353" features extracted.
Here's my code, I don't care about precision and accuracy, I'm just interested on running the model for making resource usage metrics.
sql,sc = init_spark()
df = sql.read.option("maxColumns", 100400).load(file3,format="csv",inferSchema="true",sep=',',header="false")
labelIndexer = StringIndexer(inputCol="_c0", outputCol="indexedLabel").fit(df)
cols = df.columns
cols.remove("_c0")
assembler = VectorAssembler(inputCols=cols,outputCol="features")
data = assembler.transform(df)
featureIndexer =\
VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=100).fit(data)
(trainingData, testData) = data.randomSplit([0.7, 0.3])
dt = DecisionTreeClassifier(labelCol="indexedLabel", featuresCol="indexedFeatures")
#
# # Chain indexers and tree in a Pipeline
pipeline = Pipeline(stages=[labelIndexer, featureIndexer, dt])
#
# # Train model. This also runs the indexers.
model = pipeline.fit(trainingData)
#
# # Make predictions.
predictions = model.transform(testData)
#
# # Select example rows to display.
predictions.select("prediction", "indexedLabel", "features").show(100)
predictions.printSchema()
#
evaluator = MulticlassClassificationEvaluator(
labelCol="indexedLabel", predictionCol="prediction", metricName="accuracy")
accuracy = evaluator.evaluate(predictions)
print("Accuracy = %g " % accuracy)
Please don't suggest me to use sparkdl library for features extraction using DeepImageFeaturizer because it's completely broken.

Have I made a mistake in my For loop in python code? Model accuracy is too high so double checking

I am building a quant model that takes a bunch of features and predicts performance of an index. The model is doing exceptionally well which obviously makes me wonder If I am making some mistake.
I have looked at the underlying features that I am using to ensure there is no data leakage. So now my attention is turning towards my code. Below is the main body of code that I use for prediction.
Does anything look wrong in the looping or how I am predicting? Please let me know if you need any more information and I will share what I can share.
X -> Features used in model training and prediction
y -> Class variable (1,0)
n_record -> Number of records in the dataset
n_train -> Amount of data to use for training in the rolling window construct
model -> Ensemble model from sklearn
My training data is c4500 records. I used n_train of 800 to train the first instance of the model and then a rolling window of 800 for training to predict the 801st instance (and so on). So in that way I roll through time leaving out very old data (keeping the model "current").
col_names = ['Pred', 'Actual', 'Pred Prob'] #Column names for prediction output dataframe
def Strategy (n_train):
list_ans = []
n_records = len(X) #Number of records in X
for i in range(n_train, n_records):
# creating a rolling window to train model on backward data (n_train records) and predict tomorrows performance
X_train, X_test, y_train, y_test = X[i-n_train:i], X[i:i+1], y[i-n_train:i], y[i:i+1]
X_train = ss.fit_transform(X_train)
X_test = ss.transform(X_test)
model.fit(X_train,y_train)
Pred=model.predict(X_test)
Actual = y_test.values
Prob = model.predict_proba(X_test)[:,1]
i_ans = [Pred.item(), Actual.item(), Prob.item()]
resi = pd.Series(data=i_ans, index=col_names)
list_ans.append(resi)
return pd.DataFrame(list_ans)
For 1. What values do you expect from n_record or n_train? keep in mind that n_train is acting as the min value for the range. I Don't know if this is how it should be, but be careful, you may be skipping training data.
Apart from that, it's good on my eyes!

Pseudo Labelling on Text Classification Python

I'm not good at machine learning. Can someone tell me how to doing text classification with pseudo labeling in python? I never know the right implementation, I have searched everywhere in internet, but I give up as found anything :'( I just found the implementation for numeric datasets, but I found no implementation for text classification (vectorized text).. So I wrote this syntax, but I don't know whether my code is correct or not. Am I doing wrong? Please help me guys, I really need your help.. :'(
This is my datasets if you wanna try. I want to classify 'Label' from 'Content'
My steps are:
Split data 0.75 unlabeled, 0.25 labeled
From 0.25 labeld I split: 0.75 train labeled, and 0.25 test labeled
Make vectorizer for train, test and unlabeled datasets
Build first model from train labeled, then labelling the unlabeled datasets
Concatting train labeled data with prediction of unlabeled that have >0.99 (pseudolabeled), and make the second model
Remove pseudolabeled from unabeled datasets
Predict the remaining unlabeled from second model, then iterate step 3 until the probability of predicted pseudolabeled <0.99.
This is my code:
Performing pseudo labelling on text classification
from sklearn.naive_bayes import MultinomialNB
# Initiate iteration counter
iterations = 0
# Containers to hold f1_scores and # of pseudo-labels
train_f1s = []
test_f1s = []
pseudo_labels = []
# Assign value to initiate while loop
high_prob = [1]
# Loop will run until there are no more high-probability pseudo-labels
while len(high_prob) > 0:
# Set the vector transformer (from data train)
columnTransformer = ColumnTransformer([
('tfidf',TfidfVectorizer(stop_words=None, max_features=100000),
'Content')
],remainder='drop')
def transforms(series):
before_vect = pd.DataFrame({'Content':series})
vector_transformer = columnTransformer.fit(pd.DataFrame({'Content':X_train}))
return vector_transformer.transform(before_vect)
X_train_df = transforms(X_train);
X_test_df = transforms(X_test);
X_unlabeled_df = transforms(X_unlabeled)
# Fit classifier and make train/test predictions
nb = MultinomialNB()
nb.fit(X_train_df, y_train)
y_hat_train = nb.predict(X_train_df)
y_hat_test = nb.predict(X_test_df)
# Calculate and print iteration # and f1 scores, and store f1 scores
train_f1 = f1_score(y_train, y_hat_train)
test_f1 = f1_score(y_test, y_hat_test)
print(f"Iteration {iterations}")
print(f"Train f1: {train_f1}")
print(f"Test f1: {test_f1}")
train_f1s.append(train_f1)
test_f1s.append(test_f1)
# Generate predictions and probabilities for unlabeled data
print(f"Now predicting labels for unlabeled data...")
pred_probs = nb.predict_proba(X_unlabeled_df)
preds = nb.predict(X_unlabeled_df)
prob_0 = pred_probs[:,0]
prob_1 = pred_probs[:,1]
# Store predictions and probabilities in dataframe
df_pred_prob = pd.DataFrame([])
df_pred_prob['preds'] = preds
df_pred_prob['prob_0'] = prob_0
df_pred_prob['prob_1'] = prob_1
df_pred_prob.index = X_unlabeled.index
# Separate predictions with > 99% probability
high_prob = pd.concat([df_pred_prob.loc[df_pred_prob['prob_0'] > 0.99],
df_pred_prob.loc[df_pred_prob['prob_1'] > 0.99]],
axis=0)
print(f"{len(high_prob)} high-probability predictions added to training data.")
pseudo_labels.append(len(high_prob))
# Add pseudo-labeled data to training data
X_train = pd.concat([X_train, X_unlabeled.loc[high_prob.index]], axis=0)
y_train = pd.concat([y_train, high_prob.preds])
# Drop pseudo-labeled instances from unlabeled data
X_unlabeled = X_unlabeled.drop(index=high_prob.index)
print(f"{len(X_unlabeled)} unlabeled instances remaining.\n")
# Update iteration counter
iterations += 1
I think I'm doing something wrong.. Because when I see the f1 scores it is decreasing. Please help me guys :'( I'm stressed.
f1 scores image
=================EDIT=================
So I've search on journal, then I think that I've got misunderstanding about the concept of data splitting in pseudo-labelling.
I initially thought that, the steps starts from splitting the data into labeled and unlabeled data, then from that labeled data, it was splitted into train and test.
But after surfing and searching, I found in this journal that my steps is incorrect. This journal says that the steps pseudo-labeling should start from splitting the data into train and test sets at first, and then from that train sets, data is splited to labeled and unlabeled datasets.
According to that journal, it reach the best result when splitting data into 90% of train sets and 10% of test sets. Then, from that 90% train set, it is splitted into 20% labeled data and 80% unlabeled data sets. This journal trying evidence range from 0.7 till 0.9 as boundary to drop the pseudo labeling, and on that proportion of splitting, the best evidence threshold value is 0.74. So I fix my steps with that new proportion and 0.74 threshold, and I finally got the F1 scores is increasing. Here are my steps:
Split data 0.9 train, 0.1 test sets (I labeled the test sets, so I can measure the f1 scores)
From 0.9 train, I split: 0.2 labeled, and 0.8 unlabeled data
Making vectorizer for X value of labeled train, test and unlabeled training datasets
Build first model from labeled train, then labeling the unlabeled training datasets. Then measure the F-1 scores according to the test sets (that already labeled).
Concatting train labeled data with prediction of unlabeled that have probability > 0.74 (threshold based on journal). We call this new data as pseudo-labelled, likened to the actual label), and make the second model from new train data sets.
Remove selected pseudo-labelled from unlabeled datasets
Use the second model to predict the remaining of unlabeled data, then iterate step 3 until there are no probability of predicted pseudo-labelled>0.74
So the last model is the final.
My syntax is still the same, I just changing the split proportion and I finally got my f1 scores increasing through 4 iterations: my new f1 scores.
Am I doing something right? Thank you for all of your attention guys.. So much thank you..
I'm not good at machine learning.
Overall I would say that you are quite good at Machine Learning: semi-supervised learning is an advanced type of problem and I think your solution is quite good. At least the general principle seems correct, but it's difficult to say for sure (I don't have time to analyze the code in detail sorry). A few comments:
One thing which might be improvable is the 0.74 threshold: this value certainly depends on the data, so you could do your own experiment by trying different threshold values and selecting the one which works best with your data.
Preferably it would be better to keep a final test set aside and use a separate validation set during the iterations. This would avoid the risk of data leakage.
I'm not sure about the stop condition for the loop. It might be ok but it might be worth trying other options:
Simply iterate a fixed number of times (for instance 10 times).
The stop condition could be based on "no more F1-score improvement" (i.e. stabilization of the performance), but it's a bit more advanced.
It's pretty good anyway, my comments are just ideas if you want to improve further. Note that it's been a long time since I've work with semi-supervised, I'm not sure I remember everything very well ;)

LightFM train_interactions shared among train and test sets: This will cause incorrect evaluation, check your data split

tl;dr: Working with Yelp Dataset to make a recommendation System but running into Test interactions matrix and train interactions matrix share 68 interactions. This will cause incorrect evaluation, check your data split. error when running the following LightFM code.
test_auc = auc_score(model,
test,
#train_interactions=train, #Unable to run with this line uncommented
item_features=sparse_features_matrix,
num_threads=NUM_THREADS).mean()
print('Hybrid test set AUC: %s' % test_auc)
Full Story: Working with Yelp Dataset to build a recommendation system.
Going off the code provided in example documentation (https://making.lyst.com/lightfm/docs/examples/hybrid_crossvalidated.html) for Hybrid Collaborative Filtering.
I ran my code the following way:
from sklearn.model_selection import train_test_split
from lightfm import LightFM
from scipy import sparse
from lightfm.evaluation import auc_score
train, test = train_test_split(sparse_Rating_Matrix, test_size=0.25,random_state=4)
# Set the number of threads; you can increase this
# if you have more physical cores available.
NUM_THREADS = 2
NUM_COMPONENTS = 100
NUM_EPOCHS = 3
ITEM_ALPHA = 1e-6
# Define a new model instance
model = LightFM(loss='warp',
item_alpha=ITEM_ALPHA,
no_components=NUM_COMPONENTS)
# Fit the hybrid model. Note that this time, we pass
# in the item features matrix.
model = model.fit(train,
item_features=sparse_features_matrix,
epochs=NUM_EPOCHS,
num_threads=NUM_THREADS)
# Don't forget the pass in the item features again!
train_auc = auc_score(model,
train,
item_features=sparse_features_matrix,
num_threads=NUM_THREADS).mean()
print('Hybrid training set AUC: %s' % train_auc)
test_auc = auc_score(model,
test,
#train_interactions=train, # Unable to run with this line uncommented
item_features=sparse_features_matrix,
num_threads=NUM_THREADS).mean()
print('Hybrid test set AUC: %s' % test_auc)
I had 2 problems:
1) Running the line in question uncommented (train_interactions=train) originally yielded Inconsistent Shape
which was resolved by the following:
"test" data set was modified by the following block of code to append a block of zeros below it until the dimensions match that of my train data set (per this recommendation: https://github.com/lyst/lightfm/issues/369):
#Add X users to Test so that the number of rows in Train match Test
N = train.shape[0] #Rows in Train set
n,m = test.shape #Rows & columns in Test set
z = np.zeros([(N-n),m]) #Create the necessary rows of zeros with m columns
test = test.todense() #Temporarily convert Test into a numpy array
test = np.vstack((test,z)) #Vertically stack Test on top of the blank users
test = sparse.csr_matrix(test) #Convert back to sparse
2) After the shape issue was resolved, I tried to implement "train_interactions=train"
But ran into Test interactions matrix and train interactions matrix share 68 interactions. This will cause incorrect evaluation, check your data split.
And I"m not sure how to resolve this 2nd issue. Any ideas?
Details:
-"sparse_features_matrix" is a sparse matrix of {items x categories} where if an item was "Italian" and "Pizza" then the category of "Italian" and "Pizza" would have a value "1" for that item's row ... "0" elsewhere.
-"sparse_Rating_Matrix" is a sparse matrix of {users x items} containing values of the user's ratings to the restaurant (item).
04/08/2020 Update:
LightFM has a whole Database() class object that you should use to prep your data set prior to model evaluation. I found a great github post (https://github.com/lyst/lightfm/issues/494) where user Med-ELOMARI provides an amazing walk through on a small test data set.
When I prepped my data through this method, I was able to add in user_features that I wanted to model (E.g: User_1592 likes "Thai","Mexican","Sushi" cuisines).
Per Turbo's comment, I used LightFM's random_train_test_split method (had originally split my data via sklearn's train_test_split method) and ran the auc_score with the new train/test sets AND the correctly (as far as im aware) prepared model I still run into the same error code:
Input:
%%time
(train,test) = random_train_test_split(lightfm_interactions,test_percentage=0.25) #LightFM's method to split
# Don't forget the pass in the item features again!
train_auc = auc_score(model_users,
train,
user_features=lightfm_user_features_list,
num_threads=NUM_THREADS).mean()
print('User_feature training set AUC: %s' % train_auc)
test_auc = auc_score(model_users,
test,
#train_interactions=train, #Still can't get this to function
user_features=lightfm_user_features_list,
num_threads=NUM_THREADS).mean()
print('User_feature test set AUC: %s' % test_auc)
Output if "train_interactions=train" is used:
ValueError: Test interactions matrix and train interactions matrix share 435 interactions. This will cause incorrect evaluation, check your data split.
Good news however is --- by switching from sklearn's train_test_split to LightFM's random_train_test_split my model's AUC score went from 0.49 to 0.96 on training. So I guess it's important to stick with LightFM's methods if available!
LightFM provide a way of splitting your dataset, did you look on it?
With it, it might work.
https://making.lyst.com/lightfm/docs/cross_validation.html

Random Forest Regression Accuracy different for Training set and Test set [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I am new to Machine Learning and to Python. I am trying to build a Random Forest Regression model on one of the datasets from the UCI repository. This is my first ML model. I may be entirely wrong in my approach.
The dataset is available here - https://archive.ics.uci.edu/ml/datasets/abalone
Below is the entire working code that I have written. I am using Python 3.6.4 with Windows 7 x64 OS (forgive me for the lengthy code).
import tkinter as tk # Required for enabling GUI options
from tkinter import messagebox # Required for pop-up window
from tkinter import filedialog # Required for getting full path of file
import pandas as pd # Required for data handling
from sklearn.model_selection import train_test_split # Required for splitting data into training and test set
from sklearn.ensemble import RandomForestRegressor # Required to build random forest
#------------------------------------------------------------------------------------------------------------------------#
# Create an instance of tkinter and hide the window
root = tk.Tk() # Create an instance of tkinter
root.withdraw() # Hides root window
#root.lift() # Required for pop-up window management
root.attributes("-topmost", True) # To make pop-up window stay on top of all other windows
#------------------------------------------------------------------------------------------------------------------------#
# This block of code reads input file using tkinter GUI options
print("Reading input file...")
# Pop up window to ask user the input file
File_Checker = messagebox.askokcancel("Random Forest Regression Prompt",
"At The Prompt, Enter 'Abalone_Data.csv' File.")
# Kill the execution if user selects "Cancel" in the above pop-up window
if (File_Checker == False):
quit()
else:
del(File_Checker)
file_loop = 0
while (file_loop == 0):
# Get path of base file
file_path = filedialog.askopenfilename(initialdir = "/",
title = "File Selection Prompt",
filetypes = (("XLSX Files","*.*"), ))
# Condition to check if user selected a file or not
if (len(file_path) < 1):
# Pop-up window to warn uer that no file was selected
result = messagebox.askretrycancel("File Selection Prompt Error",
"No file has been selected. \nWhat do you want to do?")
# Condition to repeat the loop or quit program execution
if (result == True):
continue
else:
quit()
# Get file name
file_name = file_path.split("/") # Splits the file with "/" as the delimiter and returns a list
file_name = file_name[-1] # extracts the last element of the list
# Condition to check if correct file was selected or not
if (file_name != "Abalone_Data.csv"):
result = messagebox.askretrycancel("File Selection Prompt Error",
"Incorrect file selected. \nWhat do you want to do?")
# Condition to repeat the loop or quit program execution
if (result == True):
continue
else:
quit()
# Read the base file
input_file = pd.read_csv(file_path,
sep = ',',
encoding = 'utf-8',
low_memory = True)
break
# Delete unwanted files
del(file_loop, file_name)
#------------------------------------------------------------------------------------------------------------------------#
print("Preparing dependent and independent variables...")
# Create Separate dataframe consisting of only dependent variable
y = pd.DataFrame(input_file['Rings'])
# Create Separate dataframe consisting of only independent variable
X = input_file.drop(columns = ['Rings'], inplace = False, axis = 1)
#------------------------------------------------------------------------------------------------------------------------#
print("Handling Dummy Variable Trap...")
# Create a new dataframe to handle categorical data
# This method splits the dategorical data column into separate columns
# This is to ensure we get rid of the dummy variable trap
dummy_Sex = pd.get_dummies(X['Sex'], prefix = 'Sex', prefix_sep = '_', drop_first = True)
# Remove the speciic columns from the dataframe
# These are the categorical data columns which split into separae columns in the previous step
X.drop(columns = ['Sex'], inplace = True, axis = 1)
# Merge the new columns to the original dataframe
X = pd.concat([X, dummy_sex], axis = 1)
#------------------------------------------------------------------------------------------------------------------------#
y = y.values
X = X.values
#------------------------------------------------------------------------------------------------------------------------#
print("Splitting datasets to training and test sets...")
# Splitting the data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
#------------------------------------------------------------------------------------------------------------------------#
print("Fitting Random Forest regression on training set")
# Fitting the regression model to the dataset
regressor = RandomForestRegressor(n_estimators = 100, random_state = 50)
regressor.fit(X_train, y_train.ravel()) # Using ravel() to avoid getting 'DataConversionWarning' warning message
#------------------------------------------------------------------------------------------------------------------------#
print("Predicting Values")
# Predicting a new result with regression
y_pred = regressor.predict(X_test)
# Enter values for new prediction as a Dictionary
test_values = {'Sex_I' : 0,
'Sex_M' : 0,
'Length' : 0.5,
'Diameter' : 0.35,
'Height' : 0.8,
'Whole_Weight' : 0.223,
'Shucked_Weight' : 0.09,
'Viscera_Weight' : 0.05,
'Shell_Weight' : 0.07}
# Convert dictionary into dataframe
test_values = pd.DataFrame(test_values, index = [0])
# Rearranging columns as required
test_values = test_values[['Length','Diameter','Height','Whole_Weight','Shucked_Weight','Viscera_Weight',
'Viscera_Weight', 'Sex_I', 'Sex_M']]
# Applying feature scaling
#test_values = sc_X.transform(test_values)
# Predicting values of new data
new_pred = regressor.predict(test_values)
#------------------------------------------------------------------------------------------------------------------------#
"""
print("Building Confusion Matrix...")
# Making the confusion matrix
cm = confusion_matrix(y_test, y_pred)
"""
#------------------------------------------------------------------------------------------------------------------------#
print("\n")
print("Getting Model Accuracy...")
# Get regression details
#print("Estimated Coefficient = ", regressor.coef_)
#print("Estimated Intercept = ", regressor.intercept_)
print("Training Accuracy = ", regressor.score(X_train, y_train))
print("Test Accuracy = ", regressor.score(X_test, y_test))
print("\n")
print("Printing predicted result...")
print("Result_of_Treatment = ", new_pred)
When I look at the model accuracy, below is what I get.
Getting Model Accuracy...
Training Accuracy = 0.9359702279804791
Test Accuracy = 0.5695080680053354
Below are my questions.
1) Why are the Training Accuracy and Test Accuracy so far away?
2) How do I know if this model is being over/under fitted?
3) Is Random forest Regression the right model to use? If no, how do i determine the right model for this use-case?
3) How can I build a confusion matrix using the variables I have created?
4) How do I validate the performance of the model?
I am looking for your guidance so that I too can learn from my mistakes and improve on my modelling skills.
Before trying to answer to your points, a comment: I see you are using a Regressor with accuracy as metric. But accuracy is a metric used in classification problems; in regressions models you usually use other metrics, as Mean Squared Error (MSE). See here.
If you just switch to a more adapt metric, maybe you will find that your model is not so bad.
I’m going anyway to reply to your questions.
Why are the Training Accuracy and Test Accuracy so far away?
This means that you overfitted your training samples: your model is very strong in predicting the data of the training dataset, but unable to generalise. Is like having a model trained on a set of cat pictures which believe only those pictures are cats, and all the other pictures of all the other cats are not. In fact, you have an accuracy on the test set of ~0.5, which is basically a random guess.
How do I know if this model is being over/under fitted?
Exactly form the difference in accuracy between the two sets. The more they are near each other, the more the model is able to generalise. You already know how on overfit looks like. An underfit is generally recognisable because of a low accuracy in both sets.
Is Random forest Regression the right model to use? If no, how do i determine the right model for this use-case?
There is not a right model to use. Random Forest, and in general all the tree-based model (LightGBM, XGBoost) are the Swiss army knife of machine learning when you are dealing with structured data, because of their simplicity and reliability. Model based on Deep Learning perform better in theory, but much more complex to set up.
How can I build a confusion matrix using the variables I have created?
Confusion matrices can be created when you build a classification model, and are built on the output of your model.
You are using a regressor, it do not have lot of sense.
How do I validate the performance of the model?
In general, for a reliable validation of performances you split the data I three: you train on one (a.k.a. training set), tune the model on a second (a.k.a. validation set, this is what you call test set), and finally, when you are happy with the model and its hyper-parameters, you test it on the third (a.k.a. test set, not to be confused with the one you call test set). This last one tells you if your model generalize well or not. This because when you choose and tune the model you can also overfit the validation set (the one you call test set), maybe selecting a set of hyper-parameters which performs well only on that set.
Also, you have to choose a reliable metric, and this depends both on the data and on the model. With regressions, the MSE is pretty good.
With Trees and Ensemble, you have to be carreful of some settings. In your case, the difference comes from "overfitting". That means, your model have learned "too much" your training datas and is not able to generalise to other datas.
One important thing to do is to limit the depth of trees. For every trees, there is a branching factor of 2. That means at depth d, you gonna have 2^d branches.
Let's imagine you have 1000 training values. If you don't limit depth
(or/and min_samples_leaf), you can learn your complete dataset with a
depth of 10 (because 2^10 = 1024 > N_training).
What you can do is to compare training accuracy and test accuracy for a range of depth ( from let's say 3 to log(n) in base 2 ). If depth is too low, both accuracy will be low as you need more branches to learn properly datas, it will rise a peak then the training data will continue to riseup but test values will go down. It should be something like the following picture with Model Complexity which is your depth.
You can also play with min_samples_split and/or min_samples_leaf which can help you to let's say split only if you have multiple datas in this branch. As a result, this will limit also depth and will allow to have a tree with different depth per branches. Same as previously explained, you can play with the value to look for the best values (with a grid Search).
I hope it helps,

Categories