Why different random_states in ML model? - python

I recently read that specifying a number for the random_state ensures to get the same results in each run.
Why do I use then random_state=1 when splitting the data into training and validation sets but random_state=0 for creating the model?
I would have expected them to be both the same value.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv')
y = (data['Man of the Match'] == "Yes") # Convert from string "Yes"/"No" to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(n_estimators=100,
random_state=0).fit(train_X, train_y)

Don't read too much into the number itself. Basically, the random_state refers to numpys random number generator numpy.random.seed() and ensures that the random numbers you create are always seeded exactly the same. Initializing with 1 will give you a different sequence than with 0. Since the split uses the random numbers for different purposes (splitting the data) than the random forest (introducing randomness to trees, e.g. with selecting sub-features for a tree, etc.). Yet, the number you give it matters little - it is only to ensure the reproducibility of your draws. To see that you might set the seed, e.g. numpy.random.seed(42) and then draw several random numbers numpy.random.rand(). Resetting again with 42 and repeating the draws will give you the same sequence.
From time to time (after setting everything up satisfactorily) it might be wise to get rid of the set random_state to see how your results look like repeatedly with more randomness included. Trying other values (or no seed at all) gives you a sense of how independent and valid your results are in the end. If you need to compare and reproduce the results exactly, the seed should be given.

Related

Random Forest Regressor Feature Importance all zero

I'm running a random forest regressor using scikit learn, but all the predictions end up being the same.
I realized that when I fit the data, all the feature importance are zero which is probably why all the predictions are the same.
This is the code that I'm using:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
merged_df = pd.read_csv("/home/jovyan/efs/vliu/combined_data.csv")
target = merged_df["400kmDensity"]
merged_df.drop("400kmDensity", axis = 1, inplace = True)
features_list = list(merged_df.columns)
#Set training and testing groups
train_features, test_features, train_target, test_target = train_test_split(merged_df, target, random_state = 16)
#Train model
rf = RandomForestRegressor(n_estimators = 150, random_state = 16)
ran = rf.fit(train_features, train_target)
print("Feature importances: ", rf.feature_importances_)
#Make predictions and calculate error
predictions = ran.predict(test_features)
print("Predictions: ", predictions)
Here's a link to the data file:
https://drive.google.com/file/d/1ECgKAH82wxIvt2OCv4W5ir1te_Vr3N_r/view?usp=sharing
If anybody can see what I did wrong before fitting the data that would result in the feature importances all being zero, that would be much appreciated.
Both your variables "400kmDensity" and "410kmDensity" share a correlation coefficient of >0.99:
np.corrcoef(merged_df["400kmDensity"], merged_df["410kmDensity"])
This practically means that you can predict "400kmDensity" almost exclusively with "410kmDensity". On a scatter plot they form an almost perfect line:
In order to actually explore what affects the values of "400kmDensity", you should exclude "410kmDensity" as a regressor (an explanatory variable). The feature importance can help to identify explanatory variables afterward.
Note that feature importance may not be a perfect metric to determine actual feature importance. Maybe you want to take a look into other available methods like Boruta Algorithm/Permutation Importance/...
In regard to the initial question: I'm not really sure why, but RandomForestRegressor seems to have a problem with your very low target variable(?). I was able to get feature importances after I scaled train_target and train_features in rf.fit(). However, this should not actually be necessary at all in order to apply Random Forest! You maybe want to take a look into the respective documentation or extend your search in this direction. Hope this serves as a hint.
fitted.rf = rf.fit(scale(train_features), scale(train_target))
As mentioned before, the feature importances after this change unsurprisingly look like this:
Also, the column "second" holds only the value zero, which does not explain anything! Your first step should be always EDA (Explanatory Data Analysis) to get a feeling for the data, like checking correlations between columns or generating histograms in order to explore data distributions [...].
There is much more to it, but I hope this gives you a leg-up!

How is it that the accuracy score for 10-fold cross validation is worst than for a 90-10 train_test_split using sklearn?

The task is binary classification via a neural network. The data is present in a dictionary, that contains composite names (as the key) of each entries and the labels (0 or 1, as the third element in the vector value). The first and second elements are the two parts of the composite name, which are used later to extract the corresponding features.
In both cases, the dictionary is transformed into two arrays for the purpose of performing a balanced undersampling of the majority class (that is present in 66% of the data):
data_for_sampling = np.asarray([key for key in list(data.keys())])
labels_for_sampling = [element[2] for element in list(data.values())]
sampler = RandomUnderSampler(sampling_strategy = 'majority')
data_sampled, label_sampled = sampler.fit_resample(data_for_sampling.reshape(-1, 1), labels_for_sampling)
Then the resampled arrays of names and labels are used to create train and test sets via the Kfold method:
kfolder = KFold(n_splits = 10, shuffle = True)
kfolder.get_n_splits(data_sampled)
for train_index, test_index in kfolder.split(data_sampled):
data_train, data_test = data_sampled[train_index], data_sampled[test_index]
Or the train_test_split method:
data_train, data_test, label_train, label_test = train_test_split(data_sampled, label_sampled, test_size = 0.1, shuffle = True)
Finally, the names from data_train and data_test are used to re-extract the relevant entries (by key) from the original dictionary, that is then used to gather the features of those entries. As far as I'm concerned, a single split of the 10-fold sets should provide similar train-test data distribution as the 90-10 train_test_split, and that seems to be true during training, where both training sets result in ~0.82 accuracy after only one epoch, run separately with model.fit(). However, when the corresponding models are evaluated using model.evaluate() on the test sets after said epoch, the set from train_test_split gives a score of ~0.86, while the set from Kfold is ~0.72. I have done numerous testing to see if it's just a bad random seed, which is not bounded, but the results stayed the same. The sets also have correctly balanced label distributions and seemingly well-shuffled entries.
As it turns out, the problem originates from a combination of sources:
While shuffle = True in the train_test_split() method properly shuffles the provided data first, then splits it into the desired parts, the shuffle = True in the Kfold method only results in the randomly built folds, however the data within the folds remains ordered.
This is something the documentation points out, thanks to this post:
https://github.com/scikit-learn/scikit-learn/issues/16068
Now, during learning, my custom train function applies shuffle again on the train data, just to be sure, however it does not shuffle the test data. Moreover, model.evaluate() defaults to batch_size = 32, if no parameter is given, which paired with the ordered test data resulted in the discrepancy in the validation accuracy. The test data is indeed flawed in the sense that it contains large portion of "hard-to-predict" entries, which were clustered together thanks to the ordering and seems like they dragged down the average accuracy in the results. Given a completed run across all N folds, as pointed out by TC Arlen, may have indeed given a more precise estimation in the end, but I've expected closer results after only one fold, which lead to the discovery of this problem.
Depending on the amount of noise in the data and on the size of the dataset, this could be expected behavior to see scores on out of sample data to deviate by this amount. One split is not guaranteed to be just like any other split, which is why you have 10 in the first place and then average across all results.
What you should trust to be the most generalizable is not any one given split (whether that comes from one of the 10 folds or train_test_split()), but what is far more trustworthy is the average result across all N folds.
Digging deeper into the data could reveal whether there is some reason why one or more splits deviate so much from another. For example, perhaps there is some feature in your data (e.g. "date the sample was collected" and the collection methodology changed from month to month) that makes the data differ from one another in a biased way. If that is the case, you should use a stratified test split (in your CV as well) (see the scikit-learn documentation on that) so you can get a more unbiased grouping of your data.

Machine learning algorithm score changes without any change in data or step

I am new to Machine learning and getting started with Titanic problem on Kaggle. I have written a simple algorithm to predict the result on test data.
My question/confusion is, every time, I execute the algorithm with the same dataset and the same steps, the score value changes (last statement in the code). I am not able to understand this behaviour?
Code:
# imports
import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
# load data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
results = pd.read_csv('gender_submission-orig.csv')
# prepare training and test dataset
y = train['Survived']
X = train.drop(['Survived', 'SibSp', 'Ticket', 'Cabin', 'Embarked', 'Name'], axis=1)
test = test.drop(['SibSp', 'Ticket', 'Cabin', 'Embarked', 'Name'], axis=1)
y_test = results['Survived']
X = pd.get_dummies(X)
test = pd.get_dummies(test)
# fill the missing values
age_median = X['Age'].median()
fare_median = X['Fare'].median()
X['Age'] = X['Age'].fillna(age_median)
test['Age'].fillna(age_median, inplace=True)
test['Fare'].fillna(fare_median, inplace=True)
# train the classifier and predict
clf = DecisionTreeClassifier()
clf.fit(X, y)
predict = clf.predict(test)
# This is the score which changes with execution.
print(round(clf.score(test, y_test) * 100, 2))
This is a usual frustration with which newcomers in the field are faced. The cause is the inherent randomness in this kind of algorithms, and the simple & straightforward remedy, as already has been suggested in the comments, is to explicitly set the state (seed) of the random number generator, e.g.:
clf = DecisionTreeClassifier(random_state=42)
But with the different values, the score also changes. So how do we find the optimal or right value?
Again, this is expected and it cannot be overcome: this kind of randomness is a fundamental & irreversible one, beyond which you simply cannot go. Setting the random seed as suggested above just ensures reproducibility of a specific model/script, but finding any "optimal" value in the sense you mean it here (i.e. regarding the random parts) is not possible. Statistically speaking, the results produced by different values of the random seed should be similar (in the statistical sense), but exact quantification of this similarity is an exercise in rigorous statistics that goes well beyond the scope of this post.
Randomness is often a non-intuitive realm, and random number generators (RNGs) themselves are strange animals... As a general note, you might be interested to know that RNG's are not even "compatible" across different languages & frameworks.

Isolation Forest Sklearn for 1D array or list and how to tune hyper parameters

Is there a way to implement sklearn isolation forest for a 1D array or list? All the examples I came across are for data of 2 Dimension or more.
I have right now developed a model with three features and the example code snipped is mentioned below:
# dataframe of three columns
df_data = datafr[['col_A', 'col_B', 'col_C']]
w_train = page_data[:700]
w_test = page_data[700:-2]
from sklearn.ensemble import IsolationForest
# fit the model
clf = IsolationForest(max_samples='auto')
clf.fit(w_train)
#testing it using test set
y_pred_test = clf.predict(w_test)
The reference I mainly relied upon: IsolationForest example | scikit-learn
The df_data is a data frame with three columns. I am actually looking to find outlier in 1 Dimension or list data.
The other question is how to tune an isolation forest model? One of the ways is to increase the contamination value to reduce the false positives. But how to use the other parameters like n_estimators, max_samples, max_features, versbose, etc.
It won't make sense to apply Isolation forest to 1D array or list. This is because in that case it would simply be a one to one mapping from feature to target.
You can read the official documentation to get a better idea of the different parameters helps
contamination
The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.
Try experimenting with different values in range [0,0.5] to see which one gives the best results
max_features
The number of features to draw from X to train each base estimator.
Try values like 5,6,10, etc. any int of your choice and validate it with the final test data
n_estimators try multiple values like 10,20,50, etc. to see which works best.
You can also use GridSearchCV to automate this process of parameter estimation.
Just try experimenting with different values using gridSearchCV and see which one gives the best results.
Try this
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score, make_scorer
my_scoring_func = make_scorer(f1_score)
parameters = {'n_estimators':[10,30,50,80], 'max_features':[0.1, 0.2, 0.3,0.4], 'contamination' : [0.1, 0.2, 0.3]}
iso_for = IsolationForest(max_samples='auto')
clf = GridSearchCV(iso_for, parameters, scoring=my_scoring_func)
Then use clf to fit the data. Although note that GridSearchCV requires bot x and y (i.e. train data and labels) for the fit method.
Note :You can read this blog post for further reference if you wish to use GridSearchCv with Isolation forest, else you can manually try with different values and plot graphs to see the results.

Python SGDregressor, different results of RMSLE

I'm trying to implement SDGregressor from scikit-learn for simple linear regression problem, but my code gives a different value of RMSLE each time?
I wonder why this is so? Also, I'm wondering how to get the least value of the RMSLE?
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from math import sqrt
import math
import matplotlib.pyplot as plt
#load data
train = pd.read_csv('Train.csv')
test = pd.read_csv('Test.csv')
#split data
x_train = train.GrLivArea[:1000].values.reshape(-1,1)
y_train = train.SalePrice[:1000].values.reshape(-1,1)
x_train_normal = np.log(x_train)
y_train_normal = np.log(y_train) #Normalization
x_test = train.GrLivArea[1000:].values.reshape(-1,1)
y_test = train.SalePrice[1000:].values.reshape(-1,1)
x_test_normal = np.log(x_test)
y_test_normal = np.log(y_test) # Normalization
y_test_transform = np.exp(y_test_normal)
Model = linear_model.SGDRegressor()
Model.n_iter = np.ceil(10**7 / len(y_train_normal))
Model.fit(x_train_normal,y_train_normal)
Sale_Prices_Predicted = Model.predict(x_test_normal)
Sale_Prices_Prediceted_Transform = np.exp(Sale_Prices_Predicted)
rmslee = rmsle(y_test_transform, Sale_Prices_Prediceted_Transform)
print("RMSLE: ", rmslee)
For example:
0.28153047299638045
0.28190513681658363
0.28207666380009233
0.28126007334118047
It is quite simple, the SGDRegessor is not initialized the same way each time. If you want to have reproducible results, you need to fix the seed.
Different random initialisations mead to slightly different results.
Very common situation in Machine Learning.
This behavior can appear for any kind of model which is randomly initialized:
Neural Networks
Random Forests
SVM
etc.
Documentation SGDRegressor: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html
random_state : int, RandomState instance or None, optional
(default=None)
The seed of the pseudo random number generator to use when shuffling
the data. If int, random_state is the seed used by the random number
generator; If RandomState instance, random_state is the random number
generator; If None, the random number generator is the RandomState
instance used by np.random.

Categories