I made a decision trees and logistical regression model. I am satisfied with the results. How do I use it on unsupervised data?
Also: Will I need to always use StandardScaler to new data?
While your question is too broad for SO I still want to give some short advices:
You need supervised data just for training stage of your model. When you already have trained model you can make predictions on unsupervised data (i.e. data that have no labels/targets) and model returns predicted labels. Usually you can do it by using predict method
Important moment: to use the predict method, it is necessary to transfer data to the model input in the same form as it was during training - the same set of features and the same number of features (excluding labels/targets of course)
The same goes for preprocessing - if you used StandardScaler for training data you must use it for new data too - the SAME StandardScaler (i.e. call transform method of already fitted on trining data scaler)
The philosophy of using StandatdScaler or some normalisation: is short - use it for linear model (and for your logistic regression). Read about it here for example: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
But for trees it is not necessary. Example: https://towardsdatascience.com/do-decision-trees-need-feature-scaling-97809eaa60c6
Related
I have always learned that standardization or normalization should be fit only on the training set, and then be used to transform the test set. So what I'd do is:
scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)
Now if I were to use this model on new data I could just save 'scaler' and load it to any new script.
I'm having trouble though understanding how this works for K-fold CV. Is it best practice to re-fit and transform the scaler on every fold? I could understand how this works on building the model, but what if I want to use this model later on. Which scaler should I save?
Further I want to extend this to time-series data. I understand how k-fold works for time-series, but again how do I combine this with CV? In this case I would suggest saving the very last scaler as this would be fit on 4/5th (In case of k=5) of the data, having it fit on the most (recent) data. Would that be the correct approach?
Is it best practice to re-fit and transform the scaler on every fold?
Yes. You might want to read scikit-learn's doc on cross-validation:
Just as it is important to test a predictor on data held-out from
training, preprocessing (such as standardization, feature selection,
etc.) and similar data transformations similarly should be learnt from
a training set and applied to held-out data for prediction.
Which scaler should I save?
Save the scaler (and any other preprocessing, i.e. a pipeline) and the predictor trained on all of your training data, not just (k-1)/k of it from cross-validation or 70% from a single split.
If you're doing a regression model, it's that simple.
If your model training requires hyperparameter search using
cross-validation (e.g., grid search for xgboost learning parameters),
then you have already gathered information from across folds, so you
need another test set to estimate true out-of-sample model
performance. (Once you have made this estimation, you can retrain
yet again on combined train+test data. This final step is not always done for neural
networks that are parameterized for a particular sample size.)
I have a set of features to build labelling functions (set A)
and another set of features to train a sklearn classifier (set B)
The generative model will output a set of probabilisitic labels which i can use to train my classifier.
Do i need to add in the features (set A) that i used for the labelling functions into my classifier features? (set B)
Or just use the labels generated to train my classifier?
I was referencing the snorkel spam tutorial and i did not see them use the features in the labelling function set to train a new classifier.
As seem in cell 47, featurization is done entirely using a CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 2))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())
X_dev = vectorizer.transform(df_dev.text.tolist())
X_valid = vectorizer.transform(df_valid.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())
And then straight to fitting a keras model:
# Define a vanilla logistic regression model with Keras
keras_model = get_keras_logreg(input_dim=X_train.shape[1])
keras_model.fit(
x=X_train,
y=probs_train_filtered,
validation_data=(X_valid, preds_to_probs(Y_valid, 2)),
callbacks=[get_keras_early_stopping()],
epochs=50,
verbose=0,
)
I asked the same question to the snorkel github page and this is the response :
you do not need to add in the features (set A) that you used for LFs
into the classifier features. In order to prevent the end model from
simply overfitting to the labeling functions, it is better if the
features for the LFs and end model (set A and set B) are as different
as possible
https://github.com/snorkel-team/snorkel-tutorials/issues/193#issuecomment-576450705
From your linked snorkel tutorial, the labeling functions (which maps input to labels ("HAM", "SPAM", "Abstain") are used to provide labels instead of features.
IIUC, the idea is to generate labels when you do not have good quality human labels. Though these "auto-generated" labels would be quite noisy, it could be served as a starting point of a labeled dataset. The learning process is to take this dataset and learn a model, which encodes the knowledge embedded in these labeling functions. Hopefully the model could be more general and the model could be applied to unseen data.
If some of these labeling function (can be considered as fixed rules instead) are very stable (regarding prediction accuracy) in certain conditions, given enough training data, your model should be able to learn that. However, in production system, to overcome the possibility of model instability, one easy fix is to override machine prediction with human labels on seen data. The same idea can be applied too if you think these labeling functions could be used for some specific input (pattern). In this case, the labeling functions would be used to directly get labels to override machine predictions. This process can be implemented as a pre-check before your machine-learned model runs.
As shown in the code below, I am using the StandardScaler.fit() function to fit (i.e., calculate the mean and variance from the features) the training dataset. Then, I call the ".transform()" function to scale the features. I found in the doc and here that I should use ".transform()" only to transform test dataset. In my case, I am trying to implement the anomaly detection model such that all training dataset is from one targeted user while all test dataset is collected from multiple other anomaly users. I mean, we have "n" users and we train the model using one class dataset samples from the targeted user while we test the trained model on new anamoly samples selected randomly from all other "n-1" anomaly users.
Training dataset size: (4816, 158) => (No of samples, No of features)
Test dataset size: (2380, 158)
The issue is the model gives bad results when I use fit() then "transform()" for the training dataset and only "transform()" for the test dataset. However, the model gives good results only when I use "fit_transform()" with both train and test datasets instead of only "transform()" for the test dataset.
My question:
Should I follow the documentation of StandardScaler such that the test dataset MUST be transformed only using ".transform()" without fit() function? Or it depends on the dataset such that I can use the "fit_transform()" function for both training and testing datasets?
Is it possible if I use "fit_transform" for both training and testing dataset?
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# After preparing and splitting the training and testing dataset, we got
X_train # from only the targeted user
X_test # from other "n-1" anomaly users
# features selection using VarianceThreshold on training set
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_train= sel.fit_transform(X_train)
#Normalization using StandardScaler
scaler = StandardScaler().fit(X_train)
normalized_X_train = scaler.transform(X_train)
set_printoptions(precision=3)
# features selection using VarianceThreshold on testing set
X_test= sel.transform(X_test)
#Normalization using StandardScaler
normalized_X_test = scaler.transform(X_test)
set_printoptions(precision=3)
Should I follow the documentation of StandardScaler such that the test dataset MUST be transformed only using ".transform()" without fit() function? Or it depends on the dataset such that I can use the "fit_transform()" function for both training and testing datasets?
The moment you are re-training your scaler for the testing set you will have a different dependincy of your input features. The original algorithm will be fitted based on the fitting of your training sacling. And if you re-train it this will be overwritten, and you are faking your input of the test data for the algorithm.
So the answer is MUST only be transformed.
The way you do it above is correct. You should, in principle, never use fit on test data, only on the train data. The fact that you get "better" results using fit_transform on the test data is not indicative of any real performance gains. In other words, by using fit on the test data, you lose the ability to say something meaningful about the predictive power of your model on unseen data.
The main lesson here is that any gains in test performance are meaningless once the methodological constraints (i.e. train-test separation) are violated. You may obtain higher scores using fit_transform, but these don't mean anything anymore.
when you want to transform a data you should declare that.
like:
data["afs"]=data["afs"].transform()
I am trying to perform cross-validation on NMF to find the best parameters to use. I tried using the sklearn cross-validation but get an error that states the NMF does not have a scoring method. Could anyone here help me with that? Thank you all
A property of nmf is that it is an unsupervised (machine learning) method. This generally means that there is no labeled data that can serve as a 'golden standard'.
In case of NMF you can not define what is the 'desired' outcome beforehand.
The cross validation in sklearn is designed for supervised machine learning, in which you have labeled data by definition.
What cross validation does, it holds out sets of labeled data, then trains a model on the data that is leftover and evaluates this model on the held out set. For this evaluation any metric can be used. For example: accuracy, precision, recall and F-measure, and for computing these measures it needs labeled data.
I came across this question while on a sklearn ML case with heavily imbalanced data. The line below provides the basis for assessing the model from confusion metrics and precision-recall perspectives but ... it is a train/predict combined method:
y_pred = model_selection.cross_val_predict(model, X, Y, cv=kfold)
The question is how do I leverage this 'cross-val-trained' model to:
1) predict on another data set (scaled) instead of having to train/predict each time?
2) export/serialize/deploy the model to predict on live data?
model.predict() #--> nope. need a fit() first
model.fit() #--> nope. a different model which does not take advantage of the cross_val_xxx methods
Any help is appreciated.
You can fit a new model with the data.
The cross validation aspect is about validating the way the model is built, not the model itself. So if the cross validation is OK, then you can train a new model with all the data.
(See my response here as well for more details Fitting sklearn GridSearchCV model)