sklearn preprocessing.scale() function , when to use it?

sklearn preprocessing.scale() function , when to use it? - python

i'm building a neural network using sklearn.neural_network.MLPClassifier :
clf = sklearn.neural_network.MLPClassifier(hidden_layer_sizes= (11,11,11),max_iter = 500)
before training it, I'm creating a new fetchers from existing ones using the
preprocessing.scale()
like so:
labels = someDataBase.loadLabels()
fetchers = someDataBase.loadFetchers()
fetchers = preprocessing.scale(fetchers)
and from them, using the train_test_split function, creating the test an train values, like so:
X_train, X_test, y_train, y_test = train_test_split(fetchers,labels,test_size = 0.2)
then I feed it to the fit function of the MLPClassifier
clf.fit(X_train, y_train)
now that I have a trained neural network
I wanna used it to predict base on a new fetchers
using the predict method of the MLPClassifier
this fetchers are not the test one, the are total new values
should I be using the preprocessing.scale() again?
and then feeding them into the predict method?
or just used them as they are?

Уour method may give different scaling factors. It is for single scaling jobs, but not for the ones requiring consistent transformation.
I suggest you rather use sklearn.preprocessing.StandardScaler. It is quite well documented and has examples here.
Call fit_transform method when training, and just transform when predicting.

Related

Permutation importance using a Pipeline in SciKit-Learn

I am using the exact example from SciKit, which compares permutation_importance with tree feature_importances
As you can see, a Pipeline is used:
rf = Pipeline([
('preprocess', preprocessing),
('classifier', RandomForestClassifier(random_state=42))
])
rf.fit(X_train, y_train)
permutation_importance:
Now, when you fit a Pipeline, it will Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
Later in the example, they used the permutation_importance on the fitted model:
result = permutation_importance(rf, X_test, y_test, n_repeats=10,
random_state=42, n_jobs=2)
Problem: What I don't understand is that the features in the result are still the original non-transformed features. Why is this the case? Is this working correctly? What is the purpose of the Pipeline then?
tree feature_importance:
In the same example, when they use the feature_importance, the results are transformed:
tree_feature_importances = (
rf.named_steps['classifier'].feature_importances_)
I can obviously transform my features and then use permutation_importance, but it seems that the steps presented in the examples are intentional, and there should be a reason why permutation_importance does not transform the features.

This is the expected behavior. The way permutation importance works is to shuffle the input data and apply it to the pipeline (or the model if that is what you want). In fact, if you want to understand how the initial input data effects the model then you should apply it to the pipeline.
If you are interested in the feature importance of each of the additional feature that is generated by your preprocessing steps, then you should generate the preprocessed dataset with column names and then apply that data to the model (using permutation importance) directly instead of the pipeline.
In most cases people are not interested in learning the impact of the secondary features that the pipeline generates. That is why they use the pipeline here to encompass the preprocessing and modeling steps.

Evaluate Loss Function Value Getting From Training Set on Cross Validation Set

I am following Andrew NG instruction to evaluate the algorithm in Classification:
Find the Loss Function of the Training Set.
Compare it with the Loss Function of the Cross Validation.
If both are close enough and small, go to next step (otherwise, there is bias or variance..etc).
Make a prediction on the Test Set using the resulted Thetas(i.e. weights) produced from the previous step as a final confirmation.
I am trying to apply this using Scikit-Learn Library, however, I am really lost there and sure that I am totally wrong (I didn't find anything similar online):
from sklearn import model_selection, svm
from sklearn.metrics import make_scorer, log_loss
from sklearn import datasets
def main():
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring=make_scorer(log_loss, greater_is_better=False))
print(results)
Error
ValueError: y_true contains only one label (0). Please provide the true labels explicitly through the labels argument.
I am not sure even it's the right way to start. Any help is very much appreciated.

Given the clarifications you provide in the comments and that you are not particularly interested in the log loss itself, I think the most straightforward approach is to abandon log loss and go for the accuracy instead:
from sklearn import model_selection, svm
from sklearn import datasets
iris = datasets.load_iris()
kfold = model_selection.KFold(n_splits=10, random_state=42)
model= svm.SVC(kernel='linear', C=1)
results = model_selection.cross_val_score(estimator=model,
X=iris.data,
y=iris.target,
cv=kfold,
scoring="accuracy") # change
Al already mentioned in the comments, inclusion of log loss in such situations still suffers from some unresolved issues in scikit-learn (see here and here).
For the purpose of estimating the generalization ability of your model, you will be fine with the accuracy metric.

This kind of error appears often when you do cross validation.
Basically your data is split into n_splits = 10 and some classes are missing on some of these splits. For example, your 9th split may not have training examples for class number 2.
So then when you evaluate your loss, the number of existing classes between your prediction and the test set do not match. So you cannot compute the loss if you have 3 classes in y_true and your model is trained to predict only 2.
What do you do in this case?
You have three possibilities:
Shuffle your data KFold(n_splits=10, random_state=42, shuffle = True)
Make n_splits bigger
provide the list of labels explicitly to the loss function as follows
args_loss = { "labels": [0,1,2] }
make_scorer(log_loss, greater_is_better=False,**args_loss)
Cherry pick your splits so you make sure this doesn't happen. I don't think Kfold allows this but GridSearchCV does

Just for future readers who are following Andrew's Course:
K-Fold is Not practically applicable to this purpose, because we mainly want to evaluate the Thetas (i.e. Weights) produced by a certain algorithm with some parameters on the Cross-Validation Set by using those Thetas in a comparison between both Cost-Functions J(train) vs J(CV) to determine if the model suffers from bias, variance or it's O.K.
Nevertheless, K-Fold is mainly for testing the prediction on the CV using the weights produced from training the Model on Training Set.

Are the two kinds of interface of xgboost work completely same?

I'm currently working on a In Class Competition in Kaggle.
I have read about the official python API reference, and I'm kind of confused about the two kinds of interfaces, especially in grid-search, cross-validation and early-stopping.
In XGBoost API, I can use xgb.cv(), which split the whose dataset into two parts to cross validate, to tune a good hyper parameters and then get the best_iteration.
Thus I can adjust the num_boost_round to the best_iteration. To maximizely utilize the data, I train the whole dataset again with the well-tuned hyper parameters, and then use it to classify. The only defect is I have to write the code of GridSearch myself.
ATTENTION: this cross validation set is changed at each fold, so the traning result will have no specific tendency to any part of the data.
But in sklearn, it seem that I can not get best_iteration using clf.fit() as I do in xgb model. Indeed, fit() method has early_stopping_rounds and eval_set to implement the early stopping part. Many people implement the code like that:
X_train, X_test, y_train, y_test = train_test_split(train, target_label, test_size=0.2, random_state=0)
clf = GridSearchCV(xgb_model, para_grid, scoring='roc_auc', cv=5, \
verbose=True, refit=True, return_train_score=False)
clf.fit(X_train, y_train, early_stopping_rounds=30, eval_set=[(X_test, y_test)])
....
clf.predict(something)
But problem is that I have split the data into two part at first. The cross validation set will not be changed at each fold. So maybe the result will have a tendency toward this random part of the whole dataset. The same problem also occurs in the grid search, the final parameter may tend to fit
X_test and y_test more.
I'm fond of the GridSearchCV in sklearn, but I also want to get the eval_set changed at each fold, just like xgb.cv do. I believe it can utilize the data while preventing overfitting.
How should I do?
I have thought of two ways:
using XGB API, and write GridSearch myself.
using sklean API, and change the eval_set manually at each fold.
Are there any more convenient methods?

AS you have summarised, both approaches have advantages and disadvantages.
xgb.cv will use the left-out fold for early stopping, thus you do not need an additional split into a validation/train sample to determine when to trigger early stopping.
GridSearchCV (or maybe you try out RandomizedSearchCV) will handle parameter grid and optimal choice for you.
Note, that it is not a problem to use a fixed sub-sample for early stopping in all CV folds. So i do not think that you have to do anything like "change the eval_set manually at each fold". The evaluation sample used in early stopping does not directly affect model parameters- it is used to decide when evaluation metric on a hold-out sample stops improving. For the final model you can drop early-stopping- you can see when the model stops with the optimal hyper-parameters using the aforementioned split and then use that number of tree as a fixed parameter in the final model fit.
So at the end it is a matter of taste as in both cases you will need to compromise on something. IMO, the sklearn API is the optimal choice as it allows to use the rest of sklearn tools (e.g. for data pre-processing) in a natural way in a pipeline in CV and it allows a homogeneous interface to model training for various approaches. But at the end it is up to you

Machine learning procedure splitting the data into 3 sets

Reading documentation and procedures while using machine learning techniques for both classification and regression I came across with some topic which actually is new for me. It seems that a recommended procedure related to split the data before training and testing is to split it into three different sets training, validation and testing. Since this procedure makes sense to me I was wondering how should I proceed with this. Let's say we split the data into these three sets, since I came across with this reading sklearn approaches and tips
If we follow some interesting approaches like what I found in here:
Stratified Train/Validation/Test-split in scikit-learn
Taking this into account let's say we want to build a classifier using LogisticRegression(any classifier actually). The procedure as far as I am concerned should be something like this, right?:
# train a logistic regression model on the training set
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Now if we want to make predictions we could use:
# make class predictions for the testing set
y_pred_class = logreg.predict(X_test)
What when one have to estimate accuracy of the model a common approach is:
# calculate accuracy
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class))
And here is where my question comes. Validation set which was splitted before should be use for calculating accuracy or for validating somehow using a Kfold cv instead?. For instance,:
# Perform 10-fold cross validation
scores = cross_val_score(logreg, df, y, cv=10)
Any hint of the procedure with these three sets would be really appreciated. What I was thinking of was that validation set should be use with train but do not know really in which way.

Is there a keras method to split data?

I think the title is self explanatory but to ask it in details, there's sklearn's method train_test_split() which works like: X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.3, stratify = Y) It means: the method will split data with 0.3 : 0.7 ratio and will try to make percentage of labels in both data equal. Is there a keras equivalent of this?

Now there is using the keras Dataset class. I'm running keras-2.2.4-tf along with the new tensorflow release.
Basically, load all the data into a Dataset using something like tf.data.Dataset.from_tensor_slices. Then split the data into new datasets for training and validation. For example, shuffle all the records in the dataset. Then use all but the first 400 as training and the first 400 as validation.
ds = ds_in.shuffle(buffer_size=rec_count)
ds_train = ds.skip(400)
ds_validate = ds.take(400)
An instance of the Dataset class is a natural container to pass around for the Keras models. I copied the concept from a tensorflow or keras training example but can't seem to find it again.
The canned datasets using the load_data method create numpy.ndarray classes so they are a little different but can be easily converted to a keras Dataset. I suspect this hasn't been done because so much existing code would break.

Unfortunately, the answer (despite our wish) is No! There are some existing datasets like MNIST etc. which can be directly loaded:
(X_train, y_train), (X_test, y_test) = mnist.load_data()
This direct loading in a splitted way makes one have a false hope to have a general method, but unfortunately that isn't present here, though you may would be interested in using the wrappers for SciKit-Learn on Keras.
There is almost similar question on DataScience SE

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.