Python SGDregressor, different results of RMSLE - python

I'm trying to implement SDGregressor from scikit-learn for simple linear regression problem, but my code gives a different value of RMSLE each time?
I wonder why this is so? Also, I'm wondering how to get the least value of the RMSLE?
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from math import sqrt
import math
import matplotlib.pyplot as plt
#load data
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
#split data
x_train = train.GrLivArea[:1000].values.reshape(-1,1)
y_train = train.SalePrice[:1000].values.reshape(-1,1)
x_train_normal = np.log(x_train)
y_train_normal = np.log(y_train) #Normalization
x_test = train.GrLivArea[1000:].values.reshape(-1,1)
y_test = train.SalePrice[1000:].values.reshape(-1,1)
x_test_normal = np.log(x_test)
y_test_normal = np.log(y_test) # Normalization
y_test_transform = np.exp(y_test_normal)
Model = linear_model.SGDRegressor()
Model.n_iter = np.ceil(10**7 / len(y_train_normal))
Model.fit(x_train_normal,y_train_normal)
Sale_Prices_Predicted = Model.predict(x_test_normal)
Sale_Prices_Prediceted_Transform = np.exp(Sale_Prices_Predicted)
rmslee = rmsle(y_test_transform, Sale_Prices_Prediceted_Transform)
print("RMSLE: ", rmslee)
For example:
0.28153047299638045
0.28190513681658363
0.28207666380009233
0.28126007334118047

It is quite simple, the SGDRegessor is not initialized the same way each time. If you want to have reproducible results, you need to fix the seed.
Different random initialisations mead to slightly different results.
Very common situation in Machine Learning.
This behavior can appear for any kind of model which is randomly initialized:
Neural Networks
Random Forests
SVM
etc.
Documentation SGDRegressor: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html
random_state : int, RandomState instance or None, optional
(default=None)
The seed of the pseudo random number generator to use when shuffling
the data. If int, random_state is the seed used by the random number
generator; If RandomState instance, random_state is the random number
generator; If None, the random number generator is the RandomState
instance used by np.random.

Related

Why different random_states in ML model?

I recently read that specifying a number for the random_state ensures to get the same results in each run.
Why do I use then random_state=1 when splitting the data into training and validation sets but random_state=0 for creating the model?
I would have expected them to be both the same value.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv')
y = (data['Man of the Match'] == "Yes") # Convert from string "Yes"/"No" to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(n_estimators=100,
random_state=0).fit(train_X, train_y)
Don't read too much into the number itself. Basically, the random_state refers to numpys random number generator numpy.random.seed() and ensures that the random numbers you create are always seeded exactly the same. Initializing with 1 will give you a different sequence than with 0. Since the split uses the random numbers for different purposes (splitting the data) than the random forest (introducing randomness to trees, e.g. with selecting sub-features for a tree, etc.). Yet, the number you give it matters little - it is only to ensure the reproducibility of your draws. To see that you might set the seed, e.g. numpy.random.seed(42) and then draw several random numbers numpy.random.rand(). Resetting again with 42 and repeating the draws will give you the same sequence.
From time to time (after setting everything up satisfactorily) it might be wise to get rid of the set random_state to see how your results look like repeatedly with more randomness included. Trying other values (or no seed at all) gives you a sense of how independent and valid your results are in the end. If you need to compare and reproduce the results exactly, the seed should be given.

How to parallelize multiple model-building procedures in sklearn

Is there a way to parallelize multiple model-building procedures in scikit-learn? I know that I can use the n_jobs argument in both GridSearchCV and cross_validate to achieve some sort of parallelization within one model building procedure. However, I am running multiple model-building procedures in a for-loop with different input parameters and save the results in a list. Just as an example, suppose I have 15 free CPUs and I am using n_jobs=5 in cross_validate. If I am not mistaken, that means that one single model-building procedure uses 5 CPUS. Now is there a way to already start the next 2 model-building procedures in my for-loop so I am using all 15 CPUS? Here's a dummy example:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)
# define different types of penalty strategies
# let's make a toy example and pretend we would be interested in
# running different penalty strategies (I use three times 'l2' here,
# but imagine these would be different)
penalty_types = ['l2','l2','l2']
# define output list where we add the results using different penalty strategies
nested_cv_scores_list = []
for penalty_type in penalty_types:
# create a random number generator
rng = np.random.RandomState(42)
# z-standardize features
scaler = StandardScaler()
# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng,penalty=penalty_type)
# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=1,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}
# create pipeline
lr_pipe = Pipeline([
('scaler',scaler),
('lr',lr)
])
# define cross validation strategy
cv = KFold(shuffle=True,random_state=rng)
# implement GridSearch (inner cross validation)
grid = GridSearchCV(lr_pipe,param_grid=p_grid,cv=cv)
# implement cross_validate (outer cross validation)
nested_cv_scores = cross_validate(grid,X,y,cv=cv,n_jobs=5)
# append result to list
nested_cv_scores_list.append(nested_cv_scores)
Is there a way to parallelize this for-loop?
joblib.parallel is made for this job! Just put your loop content in a function and call it using Parallel and delayed
from joblib.parallel import Parallel, delayed
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)
# define different types of penalty strategies
# let's make a toy example and pretend we would be interested in
# running different penalty strategies (I use three times 'l2' here,
# but imagine these would be different)
penalty_types = ['l2','l2','l2']
# define output list where we add the results using different penalty strategies
nested_cv_scores_list = []
# put rng-seed outside of loop so that not all results are the same
rng = np.random.RandomState(42)
def run_as_job(penalty_type, X, y):
# create a random number generator
# z-standardize features
scaler = StandardScaler()
# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng,penalty=penalty_type)
# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=1,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}
.... # additional calculation that is missing in the example
.... # e.g. res = cross_val_score(clf, X, y, n_jobs=2)
return res
if __name__ == '__main__':
results = Parallel(n_jobs=2)(delayed(run_as_job)(penalty_type) for penalty_type in penalty_types)
for more usage options have a look at joblib: Embarrassingly parallel for loops

ROC curve for multi-class classification without one vs all in python

I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?

Why is Multi Class Machine Learning Model Giving Bad Results?

I have the following code so far:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn import preprocessing
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
df_train = pd.read_csv('uc_data_train.csv')
del df_train['Unnamed: 0']
temp = df_train['size_womenswear']
del df_train['size_womenswear']
df_train['size_womenswear'] = temp
df_train['count'] = 1
print(df_train.head())
print(df_train.dtypes)
print(df_train[['size_womenswear', 'count']].groupby('size_womenswear').count()) # Determine number of unique catagories, and number of cases for each catagory
del df_train['count']
df_test = pd.read_csv('uc_data_test.csv')
del df_test['Unnamed: 0']
print(df_test.head())
print(df_test.dtypes)
df_train.drop(['customer_id','socioeconomic_status','brand','socioeconomic_desc','order_method',
'first_order_channel','days_since_first_order','total_number_of_orders', 'return_rate'], axis=1, inplace=True)
LE = preprocessing.LabelEncoder() # Create label encoder
df_train['size_womenswear'] = LE.fit_transform(np.ravel(df_train[['size_womenswear']]))
print(df_train.head())
print(df_train.dtypes)
x = df_train.iloc[:,np.arange(len(df_train.columns)-1)].values # Assign independent values
y = df_train.iloc[:,-1].values # and dependent values
xTrain, xTest, yTrain, yTest = train_test_split(x, y, test_size = 0.25, random_state = 0) # Testing on 75% of the data
model = GaussianNB()
model.fit(xTrain, yTrain)
yPredicted = model.predict(xTest)
#print(yPrediction)
print('Accuracy: ', accuracy_score(yTest, yPredicted))
I am not sure how to include the data that I am using but I am trying to predict the 'size_womenswear'. There are 8 different sizes that I have encoded to predict and I have moved this column to the end of the dataframe. so y is the dependent and x are the independent (all the other columns)
I am using a Gaussian Naive Bayes classifier to try and classify the 8 different sizes and then test on 25% of the data. The results are not very good.
I don't know why I am only getting an accuracy of 61% when I am working with 80,000 rows. I am very new to Machine Learning and would appreciate any assistance. Is there a better method that I could use in this case than Gaussian Naive Bayes?
can't comment, just throwing out some ideas;
Maybe you need to deal with class imbalance, and try other model that will fit the data better? try the xgboost or lightgbm package given good data they usually perform pretty good in general, but it really depends on the data.
Also the way you split train and test, does the resulting train and test data set has similar distribution for your Y? that's very important.
Last thing, for classification models the performance measurement can be a bit tricky, try some other measurement methods. F1 scores or try to draw a confusion matrix and see what your predictions vs Y looks like. perhaps your model is predicting everything to one
or just a few classes.

Python Dataframe: Different RMSE Score Every Time I Run Random Forest Regressor

I currently run a random forest model using the following code. I set a random_state equal to 100.
from sklearn.cross_validation import train_test_split
X_train_RIA_INST_PWM, X_test_RIA_INST_PWM, y_train_RIA_INST_PWM, y_test_RIA_INST_PWM = train_test_split(X_RIA_INST_PWM, Y_RIA_INST_PWM, test_size=0.3, random_state = 100)
# Random Forest Regressor for RIA_INST_PWM accounts
import numpy as np
from sklearn.ensemble import RandomForestRegressor
regressor_RIA_INST_PWM = RandomForestRegressor(n_estimators=100, min_samples_split = 10)
regressor_RIA_INST_PWM.fit(X_RIA_INST_PWM, Y_RIA_INST_PWM)
print ("R^2 for training set:"),
print (regressor_RIA_INST_PWM.score(X_train_RIA_INST_PWM, y_train_RIA_INST_PWM))
print ('-'*50)
print ("R^2 for test set:"),
print (regressor_RIA_INST_PWM.score(X_test_RIA_INST_PWM, y_test_RIA_INST_PWM))
And then I use the following code to calculate the prediction values.
def predict_AUM(df, features, regressor):
# Reset index for later merge of predicted target values with Account IDs
df.reset_index();
# Set predictor variables
X_Predict = df[features]
# Clean inputs
X_Predict = X_Predict.replace([np.inf, -np.inf], np.nan)
X_Predict = X_Predict.fillna(0)
# Predict Current_AUM
Y_AUM_Snapshot_1yr_Predict = regressor.predict(X_Predict)
df['PREDICTED_SPAN'] = Y_AUM_Snapshot_1yr_Predict
return df
df_EVENT5_20 = predict_AUM(df_EVENT5_19, dfzip_features_AUM_RIA_INST_PWM, regressor_RIA_INST_PWM)
Finally, I calculate the RMSE of my results:
from sklearn.metrics import mean_squared_error
from math import sqrt
rmse = sqrt(mean_squared_error(df_EVENT5_20['SPAN_DAYS'], df_EVENT5_20['PREDICTED_SPAN']))
rmse
Each time I run my code ... my RMSE changes. It has varied from 7.75 to 16.4 Why is this happening? And how can I have the same RMSE each time I run the code? Additionally, how do I optimize my model for RMSE?
You only seeded the train_test_split which makes sure that the random allocation of the data to train and test set is reproducible.
As the name suggests RandomForestRegressor also contains parts in the algorithm that rely on random numbers (e.g., specifically different parts of data or different features for training individual decision trees). If you want reproducible results, you need to seed it as well. For that you need to initilize it with random_state like that:
regressor_RIA_INST_PWM = RandomForestRegressor(
n_estimators=100,
min_samples_split = 10,
random_state=100
)

Categories