Classification results depend on random_state?

Classification results depend on random_state? - python

I want to implement a AdaBoost model using scikit-learn (sklearn). My question is similar to another question but it is not totally the same. As far as I understand, the random_state variable described in the documentation is for randomly splitting the training and testing sets, according to the previous link. So if I understand correctly, my classification results should not be dependent on the seeds, is it correct? Should I be worried if my classification results turn out to be dependent on the random_state variable?

Your classification scores will depend on random_state. As #Ujjwal rightly said, it is used for splitting the data into training and test test. Not just that, a lot of algorithms in scikit-learn use the random_state to select the subset of features, subsets of samples, and determine the initial weights etc.
For eg.
Tree based estimators will use the random_state for random selections of features and samples (like DecisionTreeClassifier, RandomForestClassifier).
In clustering estimators like Kmeans, random_state is used to initialize centers of clusters.
SVMs use it for initial probability estimation
Some feature selection algorithms also use it for initial selection
And many more...
Its mentioned in the documentation that:
If your code relies on a random number generator, it should never use functions like numpy.random.random or numpy.random.normal. This approach can lead to repeatability issues in tests. Instead, a numpy.random.RandomState object should be used, which is built from a random_state argument passed to the class or function.
Do read the following questions and answers for better understanding:
Choosing random_state for sklearn algorithms
confused about random_state in decision tree of scikit learn

It does matter. When your training set differs then your trained state also changes. For a different subset of data you can end up with a classifier which is little different from the one trained with some other subset.
Hence, you should use a constant seed like 0 or another integer, so that your results are reproducible.

Related

Get individual models and customized score in GridSearchCV and RandomizedSearchCV [duplicate]

This question already has an answer here:
Retrieving specific classifiers and data from GridSearchCV
(1 answer)
Closed 2 years ago.
GridSearchCV and RandomizedSearchCV has best_estimator_ that :
Returns only the best estimator/model
Find the best estimator via one of the simple scoring methods : accuracy, recall, precision, etc.
Evaluate based on training sets only
I would like to enrich those limitations with
My own definition of scoring methods
Evaluate further on test set rather than training as done by GridSearchCV. Eventually it's the test set performance that counts. Training set tends to give almost perfect accuracy on my Grid Search.
I was thinking of achieving it by :
Get the individual estimators/models in GridSearchCV and RandomizedSearchCV
With every estimator/model, predict on test set and evaluate with my customized score
My question is:
Is there a way to get all individual models from GridSearchCV ?
If not, what is your thought to achieve the same thing as what I wanted ? Initially I wanted to exploit existing GridSearchCV because it handles automatically multiple parameter grid, CV and multi-threading. Any other recommendation to achieve the similar result is welcome.

You can use custom scoring methods already in the XYZSearchCVs: see the scoring parameter and the documentation's links to the User Guide for how to write a custom scorer.
You can use a fixed train/validation split to evaluate the hyperparameters (see the cv parameter), but this will be less robust than a k-fold cross-validation. The test set should be reserved for scoring only the final model; if you use it to select hyperparameters, then the scores you receive will not be unbiased estimates of future performance.
There is no easy way to retrieve all the models built by GridSearchCV. (It would generally be a lot of models, and saving them all would generally be a waste of memory.)
The parallelization and parameter grid parts of GridSearchCV are surprisingly simple; if you need to, you can copy out the relevant parts of the source code to produce your own approach.
Training set tends to give almost perfect accuracy on my Grid Search.
That's a bit surprising, since the CV part of the searches means the models are being scored on unseen data. If you get very high best_score_ but low performance on the test set, then I would suspect your training set is not actually a representative sample, and that'll require a much more nuanced understanding of the situation.

xgboost: Sample Weights for Imbalanced Data?

I have a highly unbalanced dataset of 3 classes. To address this, I applied the sample_weight array in the XGBClassifier, but I'm not noticing any changes in the modelling results? All of the metrics in the classification report (confusion matrix) are the same. Is there an issue with the implementation?
The class ratios:
military: 1171
government: 34852
other: 20869
Example:
pipeline = Pipeline([
('bow', CountVectorizer(analyzer=process_text)), # convert strings to integer counts
('tfidf', TfidfTransformer()), # convert integer counts to weighted TF-IDF scores
('classifier', XGBClassifier(sample_weight=compute_sample_weight(class_weight='balanced', y=y_train))) # train on TF-IDF vectors w/ Naive Bayes classifier
])
Sample of Dataset:
data = pd.DataFrame({'entity_name': ['UNICEF', 'US Military', 'Ryan Miller'],
'class': ['government', 'military', 'other']})
Classification Report

First, most important: use a multiclass eval_metric. eval_metric=merror or mlogloss, then post us the results. You showed us ['precision','recall','f1-score','support'], but that's suboptimal, or outright broken unless you computed them in a multi-class-aware, imbalanced-aware way.
Second, you need weights. Your class ratio is military: government: other 1:30:18, or as percentages 2:61:37%.
You can manually set per-class weights with xgb.DMatrix..., weights)
Look inside your pipeline (use print or verbose settings, dump values), don't just blindly rely on boilerplate like sklearn.utils.class_weight.compute_sample_weight('balanced', ...) to give you optimal weights.
Experiment with manually setting per-class weights, starting with 1 : 1/30 : 1/18 and try more extreme values. Reciprocals so the rarer class gets higher weight.
Also try setting min_child_weight much higher, so it requires a few exemplars (of the minority classes). Start with min_child_weight >= 2(* weight of rarest class) and try going higher. Beware of overfitting to the very rare minority class (this is why people use StratifiedKFold crossvalidation, for some protection, but your code isn't using CV).
We can't see your other parameters for xgboost classifier (how many estimators? early stopping on or off? what was learning_rate/eta? etc etc.). Seems like you used the defaults - they'll be terrible. Or else you're not showing your code. Distrust xgboost's defaults, esp. for multiclass, don't expect xgboost to give good out-of-the-box results. Read the doc and experiment with values.
Do all that experimentation, post your results, check before concluding "it doesn't work". Don't expect optimal results from out-of-the-box. Distrust or double-check the sklearn util functions, try manual alternatives. (Often, just because sklearn has a function to do something, doesn't mean it's good or best or suitable for all use-cases, like imbalanced multiclass)

Does scikit-learn use One-Vs-Rest by default in multi-class classification?

I am dealing with a multi-class problem (4 classes) and I am trying to solve it with scikit-learn in Python.
I saw that I have three options:
I simply instantiate a classifier, then I fit with train and evaluate with test;
classifier = sv.LinearSVC(random_state=123)
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsRest object, generating a new classifier that I use for train and test;
classifier = OneVsRestClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I "encapsulate" the instantiated classifier in a OneVsOne object, generating a new classifier that I use for train and test.
classifier = OneVsOneClassifier(svm.LinearSVC(random_state=123))
classifier.fit(Xtrain, ytrain)
classifier.score(Xtest, ytest)
I understand the difference between OneVsRest and OneVsOne, but I cannot understand what I am doing in the first scenario where I do not explicitly pick up any of these two options. What does scikit-learn do in that case? Does it implicitly use OneVsRest?
Any clarification on the matter would be highly appreciated.
Best,
MR
Edit:
Just to make things clear, I am not specifically interested in the case of SVMs. For example, what about RandomForest?

Updated answer: As clarified in the comments and edits, the question is more about the general setting of sklearn, and less about the specific case of LinearSVC which is explained below.
The main difference here is that some of the classifiers you can use have "built-in multiclass classification support", i.e. it is possible for that algorithm to discern between more than two classes by default. One example for this would for example be a Random Forest, or a Multi-Layer Perceptron (MLP) with multiple output nodes.
In these cases, having a OneVs object is not required at all, since you are already solving your task. In fact, using such a strategie might even decreaes your performance, since you are "hiding" potential correlations from the algorithm, by letting it only decide between single binary instances.
On the other hand, algorithms like SVC or LinearSVC only support binary classification. So, to extend these classes of (well-performing) algorithms, we instead have to rely on the reduction to a binary classification task, from our initial multiclass classification task.
As far as I am aware of, the most complete overview can be found here:
If you scroll down a little bit, you can see which one of the algorithms is inherently multiclass, or uses either one of the strategies by default.
Note that all of the listed algorithms under OVO actually now employ a OVR strategy by default! This seems to be slightly outdated information in that regard.
Initial answer:
This is a question that can easily be answered by looking at the relevant scikit-learn documentation.
Generally, the expectation on Stackoverflow is that you have at least done some form of research on your own, so please consider looking into existing documentation first.
multi_class : string, ‘ovr’ or ‘crammer_singer’ (default=’ovr’)
Determines the multi-class strategy if y contains more than two classes. "ovr" trains n_classes one-vs-rest classifiers, while
"crammer_singer" optimizes a joint objective over all classes. While
crammer_singer is interesting from a theoretical perspective as it is
consistent, it is seldom used in practice as it rarely leads to better
accuracy and is more expensive to compute. If "crammer_singer" is
chosen, the options loss, penalty and dual will be ignored.
So, clearly, it uses one-vs-rest.
The same holds by the way for the "regular" SVC.

Linear regression: Good results for training data, horrible for test data

I am working with a dataset of about 400.000 x 250.
I have a problem with the model yielding a very good R^2 score when testing it on the training set, but extremely poorly when used on the test set. Initially, this sounds like overfitting. But the data is split into training/test set at random and the data set i pretty big, so I feel like there has to be something else.
Any suggestions?
Splitting dataset into training set and test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(df.drop(['SalePrice'],
axis=1), df.SalePrice, test_size = 0.3)
Sklearn's Linear Regression estimator
from sklearn import linear_model
linReg = linear_model.LinearRegression() # Create linear regression object
linReg.fit(X_train, y_train) # Train the model using the training sets
# Predict from training set
y_train_linreg = linReg.predict(X_train)
# Predict from test set
y_pred_linreg = linReg.predict(X_test)
Metric calculation
from sklearn import metrics
metrics.r2_score(y_train, y_train_linreg)
metrics.r2_score(y_test, y_pred_linreg)
R^2 score when testing on training set: 0,64
R^2 score when testing on testing set: -10^23 (approximatly)

While I agree with Mihai that your problem definitely looks like overfitting, I don't necessarily agree on his answer that neural network would solve your problem; at least, not out of the box. By themselves, neural networks overfit more, not less, than linear models. You need somehow to take care of your data, hardly any model can do that for you. A few options that you might consider (apologies, I cannot be more precise without looking at the dataset):
Easiest thing, use regularization. 400k rows is a lot, but with 250 dimensions you can overfit almost whatever you like. So try replacing LinearRegression by Ridge or Lasso (or Elastic Net or whatever). See http://scikit-learn.org/stable/modules/linear_model.html (Lasso has the advantage of discarding features for you, see next point)
Especially if you want to go outside of linear models (and you probably should), it's advisable to first reduce the dimension of the problem, as I said 250 is a lot. Try using some of the Feature selection techniques here: http://scikit-learn.org/stable/modules/feature_selection.html
Probably most importantly than anything else, you should consider adapting your input data. The very first thing I'd try is, assuming you are really trying to predict a price as your code implies, to replace it by its logarithm, or log(1+x). Otherwise linear regression will try very very hard to fit that single object that was sold for 1 Million $ ignoring everything below $1k. Just as important, check if you have any non-numeric (categorical) columns and keep them only if you need them, in case reducing them to macro-categories: a categorical column with 1000 possible values will increase your problem dimension by 1000, making it an assured overfit. A single column with a unique categorical data for each input (e.g. buyer name) will lead you straight to perfect overfitting.
After all this (cleaning data, reducing dimension via either one of the methods above or just Lasso regression until you get to certainly less than dim 100, possibly less than 20 - and remember that this includes any categorical data!), you should consider non-linear methods to further improve your results - but that's useless until your linear model provides you at least some mildly positive R^2 value on test data. sklearn provides a lot of them: http://scikit-learn.org/stable/modules/kernel_ridge.html is the easiest to use out-of-the-box (also does regularization), but it might be too slow to use in your case (you should first try this, and any of the following, on a subset of your data, say 1000 rows once you've selected only 10 or 20 features and see how slow that is). http://scikit-learn.org/stable/modules/svm.html#regression have many different flavours, but I think all but the linear one would be too slow. Sticking to linear things, http://scikit-learn.org/stable/modules/sgd.html#regression is probably the fastest, and would be how I'd train a linear model on this many samples. Going truly out of linear, the easiest techniques would probably include some kind of trees, either directly http://scikit-learn.org/stable/modules/tree.html#regression (but that's an almost-certain overfit) or, better, using some ensemble technique (random forests http://scikit-learn.org/stable/modules/ensemble.html#forests-of-randomized-trees are the typical go-to algorithm, gradient boosting http://scikit-learn.org/stable/modules/ensemble.html#gradient-tree-boosting sometimes works better). Finally, state-of-the-art results are indeed generally obtained via neural networks, see e.g. http://scikit-learn.org/stable/modules/neural_networks_supervised.html but for these methods sklearn is generally not the right answer and you should take a look at dedicated environments (TensorFlow, Caffe, PyTorch, etc.)... however if you're not familiar with those it is certainly not worth the trouble!

Break up Random forest classification fit into pieces in python?

I have almost 900,000 rows of information that I want to run through scikit-learn's Random Forest Classifier algorithm. Problem is, when I try to create the model my computer freezes completely, so what I want to try is running the model every 50,000 rows but I'm not sure if this is possible.
So the code I have now is
# This code freezes my computer
rfc.fit(X,Y)
#what I want is
model = rfc.fit(X.ix[0:50000],Y.ix[0:50000])
model = rfc.fit(X.ix[0:100000],Y.ix[0:100000])
model = rfc.fit(X.ix[0:150000],Y.ix[0:150000])
#... and so on

Feel free to correct me if I'm wrong, but I assume you're not using the most current version of scikit-learn (0.16.1 as of writing this), that you're on a Windows machine and using n_jobs=-1 (or a combination of all three). So my suggestion would be to first upgrade scikit-learn or set n_jobs=1 and try fitting on the whole dataset.
If that fails, take a look at the warm_start parameter. By setting it to True and gradually incrementing n_estimators you can fit additional trees on subsets of your data:
# First build 100 trees on the first chunk
clf = RandomForestClassifier(n_estimators=100, warm_start=True)
clf.fit(X.ix[0:50000],Y.ix[0:50000])
# add another 100 estimators on chunk 2
clf.set_params(n_estimators=200)
clf.fit(X.ix[0:100000],Y.ix[0:100000])
# and so forth...
clf.set_params(n_estimators=300)
clf.fit(X.ix[0:150000],Y.ix[0:150000])
Another possibility is to fit a new classifier on each chunk and then simply average the predictions from all classifiers or merging the trees into one big random forest like described here.

Another method similar to the one linked in Andreus' answer is to grow the trees in the forest individually.
I did this a while back: basically I trained a number of DecisionTreeClassifier's one at a time on different partitions of the training data. I saved each model via pickling, and afterwards I loaded them into a list which was assigned to the estimators_ attribute of a RandomForestClassifier object. You also have to take care to set the rest of the RandomForestClassifier attributes appropriately.
I ran into memory issues when I built all the trees in a single python script. If you use this method and run into that issue, there's a work-around, I posted in the linked question.

from sklearn.datasets import load_iris
boston = load_iris()
X, y = boston.data, boston.target
### RandomForestClassifier
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier(n_estimators=10, warm_start=True)
rfc.fit(X[:50], y[:50])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[51:100], y[51:100])
print(rfc.score(X, y))
rfc.n_estimators += 10
rfc.fit(X[101:150], y[101:150])
print(rfc.score(X, y))
Below is differentiation between warm_start and partial_fit.
When fitting an estimator repeatedly on the same dataset, but for multiple parameter values (such as to find the value maximizing performance as in grid search), it may be possible to reuse aspects of the model learnt from the previous parameter value, saving time. When warm_start is true, the existing fitted model attributes an are used to initialise the new model in a subsequent call to fit.
Note that this is only applicable for some models and some parameters, and even some orders of parameter values. For example, warm_start may be used when building random forests to add more trees to the forest (increasing n_estimators) but not to reduce their number.
partial_fit also retains the model between calls, but differs: with warm_start the parameters change and the data is (more-or-less) constant across calls to fit; with partial_fit, the mini-batch of data changes and model parameters stay fixed.
There are cases where you want to use warm_start to fit on different, but closely related data. For example, one may initially fit to a subset of the data, then fine-tune the parameter search on the full dataset. For classification, all data in a sequence of warm_start calls to fit must include samples from each class.

Some algorithms in scikit-learn implement 'partial_fit()' methods, which is what you are looking for. There are random forest algorithms that do this, however, I believe the scikit-learn algorithm is not such an algorithm.
However, this question and answer may have a workaround that would work for you. You can train forests on different subsets, and assemble a really big forest at the end:
Combining random forest models in scikit learn

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.