I am working with a project to detect out-of-domain text input, with the help of IsolationForest and tf-idf feature. Following is my works in summarized form:
TRAINING
On tfidf:
Fit and transform in-domain dataset using CountVectorizer().
Fit a tfidftransformer() with my with this CountVectorizer() and save the transformer (to use it during test time).
Therefore, transform the training data using tfidftransformer()
Save both CountVectorizer()'s vocabulary_ and TfidfTransformer() object using pickle for test time usage.
On IsolationForest:
Collect the transformed in-domain dataset and train a IsolationForest() novelity detector.
Save the model using joblib.
TESTING:
Load all of the saved models.
Get the tfidf transformed feature of current out-of-domain input text after replicating all the steps (transformations only) similar to training step.
Predict if it is out-of-domain or not, using the saved IsolationForest model.
But what I have found even if the tf-idf feature is quite different for each of my test input, the IsolationForest always predicting 1.
What is probably going wrong?
NB: I also tried inputting dummy vectors to IsolationForest model by mimicking the output of tf-idf transformer to make sure if the tf-idf module is responsible for this or not but no matter which random vector I provide I always get 1 as output from IsolationForest. Also note that, tf-idf has a lot of features (tokens), in my case the count is 48015.
Related
I made a decision trees and logistical regression model. I am satisfied with the results. How do I use it on unsupervised data?
Also: Will I need to always use StandardScaler to new data?
While your question is too broad for SO I still want to give some short advices:
You need supervised data just for training stage of your model. When you already have trained model you can make predictions on unsupervised data (i.e. data that have no labels/targets) and model returns predicted labels. Usually you can do it by using predict method
Important moment: to use the predict method, it is necessary to transfer data to the model input in the same form as it was during training - the same set of features and the same number of features (excluding labels/targets of course)
The same goes for preprocessing - if you used StandardScaler for training data you must use it for new data too - the SAME StandardScaler (i.e. call transform method of already fitted on trining data scaler)
The philosophy of using StandatdScaler or some normalisation: is short - use it for linear model (and for your logistic regression). Read about it here for example: https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html
But for trees it is not necessary. Example: https://towardsdatascience.com/do-decision-trees-need-feature-scaling-97809eaa60c6
I have always learned that standardization or normalization should be fit only on the training set, and then be used to transform the test set. So what I'd do is:
scaler = StandardScaler()
scaler.fit_transform(X_train)
scaler.transform(X_test)
Now if I were to use this model on new data I could just save 'scaler' and load it to any new script.
I'm having trouble though understanding how this works for K-fold CV. Is it best practice to re-fit and transform the scaler on every fold? I could understand how this works on building the model, but what if I want to use this model later on. Which scaler should I save?
Further I want to extend this to time-series data. I understand how k-fold works for time-series, but again how do I combine this with CV? In this case I would suggest saving the very last scaler as this would be fit on 4/5th (In case of k=5) of the data, having it fit on the most (recent) data. Would that be the correct approach?
Is it best practice to re-fit and transform the scaler on every fold?
Yes. You might want to read scikit-learn's doc on cross-validation:
Just as it is important to test a predictor on data held-out from
training, preprocessing (such as standardization, feature selection,
etc.) and similar data transformations similarly should be learnt from
a training set and applied to held-out data for prediction.
Which scaler should I save?
Save the scaler (and any other preprocessing, i.e. a pipeline) and the predictor trained on all of your training data, not just (k-1)/k of it from cross-validation or 70% from a single split.
If you're doing a regression model, it's that simple.
If your model training requires hyperparameter search using
cross-validation (e.g., grid search for xgboost learning parameters),
then you have already gathered information from across folds, so you
need another test set to estimate true out-of-sample model
performance. (Once you have made this estimation, you can retrain
yet again on combined train+test data. This final step is not always done for neural
networks that are parameterized for a particular sample size.)
I am trying to transform two datasets: x_train and x_test using tsne. I assume the way to do this is to fit tsne to x_train, and then transform x_test and x_train. But, I am not able to transform any of the datasets.
tsne = TSNE(random_state = 420, n_components=2, verbose=1, perplexity=5, n_iter=350).fit(x_train)
I assume that tsne has been fitted to x_train.
But, when I do this:
x_train_tse = tsne.transform(x_subset)
I get:
AttributeError: 'TSNE' object has no attribute 'transform'
Any help will be appreciated. (I know I could do fit_transform, but wouldn't I get the same error on x_test?)
Judging by the documentation of sklearn, TSNE simply does not have any transform method.
Also, TSNE is an unsupervised method for dimesionality reduction/visualization, so it does not really work with a TRAIN and TEST. You simply take all of your data and use fit_transform to have the transformation and plot it.
EDIT - It is actually not possible to learn a transformation and reuse it on different data (i.e. Train and Test), as T-sne does not learn a mapping function on a lower dimensional space, but rather runs an iterative procedure on a subspace to find an equilibrium that minimizes a loss/distance ON SOME DATA.
Therefore if you want to preprocess and reduce dimensionality of both a Train and Test datasets, the way to go is PCA/SVD or Autoencoders. T-Sne will only help you for unsupervised tasks :)
As the accepted answer says, there is no separate transform method and it probably wouldn't work in a a train/test setting.
However, you can still use TSNE without information leakage.
Training Time
Calculate the TSNE per record on the training set and use it as a feature in classification algorithm.
Testing Time
Append your training and testing data and fit_transform the TSNE. Now continue on processing your test set, using the TSNE as a feature on those records.
Does this cause information leakage? No.
Inference Time
New records arrive e.g. as images or table rows.
Add the new row(s) to the training table, calculate TSNE (i.e. where the new sample sits in the space relative to your trained samples). Perform any other processing and run your prediction against the row.
It works fine. Sometimes, we worry too much about train/test split because of Kaggle etc. But the main thing is can your method be replicated at inference time and with the same expected accuracy for live use. In this case, yes it can!
Only drawback is you need your training database available at inference time and depending on size, the preprocessing might be costly.
Check the openTSNE1 out. It has all you need.
You can also save the trained model using pickle.dump for example.
[1]: https://opentsne.readthedocs.io/en/latest/index.html
As shown in the code below, I am using the StandardScaler.fit() function to fit (i.e., calculate the mean and variance from the features) the training dataset. Then, I call the ".transform()" function to scale the features. I found in the doc and here that I should use ".transform()" only to transform test dataset. In my case, I am trying to implement the anomaly detection model such that all training dataset is from one targeted user while all test dataset is collected from multiple other anomaly users. I mean, we have "n" users and we train the model using one class dataset samples from the targeted user while we test the trained model on new anamoly samples selected randomly from all other "n-1" anomaly users.
Training dataset size: (4816, 158) => (No of samples, No of features)
Test dataset size: (2380, 158)
The issue is the model gives bad results when I use fit() then "transform()" for the training dataset and only "transform()" for the test dataset. However, the model gives good results only when I use "fit_transform()" with both train and test datasets instead of only "transform()" for the test dataset.
My question:
Should I follow the documentation of StandardScaler such that the test dataset MUST be transformed only using ".transform()" without fit() function? Or it depends on the dataset such that I can use the "fit_transform()" function for both training and testing datasets?
Is it possible if I use "fit_transform" for both training and testing dataset?
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
# After preparing and splitting the training and testing dataset, we got
X_train # from only the targeted user
X_test # from other "n-1" anomaly users
# features selection using VarianceThreshold on training set
sel = VarianceThreshold(threshold=(.8 * (1 - .8)))
X_train= sel.fit_transform(X_train)
#Normalization using StandardScaler
scaler = StandardScaler().fit(X_train)
normalized_X_train = scaler.transform(X_train)
set_printoptions(precision=3)
# features selection using VarianceThreshold on testing set
X_test= sel.transform(X_test)
#Normalization using StandardScaler
normalized_X_test = scaler.transform(X_test)
set_printoptions(precision=3)
Should I follow the documentation of StandardScaler such that the test dataset MUST be transformed only using ".transform()" without fit() function? Or it depends on the dataset such that I can use the "fit_transform()" function for both training and testing datasets?
The moment you are re-training your scaler for the testing set you will have a different dependincy of your input features. The original algorithm will be fitted based on the fitting of your training sacling. And if you re-train it this will be overwritten, and you are faking your input of the test data for the algorithm.
So the answer is MUST only be transformed.
The way you do it above is correct. You should, in principle, never use fit on test data, only on the train data. The fact that you get "better" results using fit_transform on the test data is not indicative of any real performance gains. In other words, by using fit on the test data, you lose the ability to say something meaningful about the predictive power of your model on unseen data.
The main lesson here is that any gains in test performance are meaningless once the methodological constraints (i.e. train-test separation) are violated. You may obtain higher scores using fit_transform, but these don't mean anything anymore.
when you want to transform a data you should declare that.
like:
data["afs"]=data["afs"].transform()
I'm trying to build a linear classifier using LinearSVC in Scikit learn. I decided to use the tf-idf vectorization for the purpose of vectorizing the text input. The code I wrote is:
review_corpus = list(train_data_df['text'])
vectorizer = TfidfVectorizer(max_df = 0.9,stop_words = 'english')
%timeit tfidf_matrix = vectorizer.fit_transform(review_corpus)
I now want to train an SVM model using this tfidf_matrix and use it to predict the class/label for the corresponding test set: test_data_df['text'].
The problem(s) I'm having:
Is it correct to use only the training data to build the TfIdfVectorizer or should I use both the training and testing text data to build the vectorizer?
The main issue is: How do I get the matrix representation for the testing data? Currently, I'm not sure how to get the tfidf score from the vectorizer for the different documents in the test set. What I tried was to loop through the Pandas series test_data_df['text'] and then do:
tfidf_matrix.todense(list(text)
for each text in the Series, put the result into a list and finally make a numpy array out of it but I get a Memory Error.
You should use only the training data to build the TfIdfVectorizer(). This will ensure that you are not leaking any information about the test data during training process.
Use
tfidf_matrix_test = vectorizer.transform(test_data_df['text'])
Now you can feed the tfidf_matrix_test to the classifier.
P.S.:
Try to avoid casting the sparse_matrix output of the Vectorizer to list or dense array. Because it is memory intensive and classifier will also take more computation time while training/prediction.