Outlier-Detection in Scikit-learn( Isolation Forest) in a pipeline - python

I have encountered the problem, as I can't use the Isolation Forest algorithm in the Sklearn pipeline. I am trying to predict the credit card default using the Kaggle Credit Card Fraud Detection dataset. I am trying to fix everything after data partitioning in order to avoid data leakage. (By using pipelines for every cross-validation as I get an almost 100% F1-score using Logistic Regression in K-fold cross-validation without using pipelines) Most of the machine learning algorithms can be used (Logistic Regression, Random Forest Classifier, etc) but not for some anomaly detection algorithms such as IsolationForest. I wondered how can I fit these anomaly detection algorithms inside the Pipelines. Thanks.
Some details for X and Y (Y- 0 as a normal transaction, 1 as fraudulent transaction)
pipe =Pipeline([
('sc', StandardScaler()),
('smote', SMOTE()),
('IF', IsolationForest())
])
print(cross_val_score(pipe, X,Y, scoring='f1_weighted' ,cv=5))
# Result: [3.01179163e-06 3.53204982e-06 6.55363495e-06 3.51940600e-06 4.52981524e-06]

Without further information, I would guess that your Pipeline import is from sklearn.pipelines. Just replace it with:
from imblearn.pipeline import Pipeline
For further information this helped me.

Related

How to compare time-series predictions between XGB and Random Forest

I have time series forecasting assignment, and I used a random forest regressor and XGBoost to predict the future price.
I would like to ask what kind of code or what I should do as a conclusion assignment to choose which result prediction more better.
XBG and Randomforest
any help for a link or share code much appreciates it because I try to google but still can't find the solution and it's near my dateline.
Diebold-Mariano Test is one of the statistical methods to compare forecasting predictions. It identifies forecast accuracy equivalence for 2 sets of predictions.
Diebold-Mariano Test Implementation from Kaggle

How to perform cross validation on NMF Python

I am trying to perform cross-validation on NMF to find the best parameters to use. I tried using the sklearn cross-validation but get an error that states the NMF does not have a scoring method. Could anyone here help me with that? Thank you all
A property of nmf is that it is an unsupervised (machine learning) method. This generally means that there is no labeled data that can serve as a 'golden standard'.
In case of NMF you can not define what is the 'desired' outcome beforehand.
The cross validation in sklearn is designed for supervised machine learning, in which you have labeled data by definition.
What cross validation does, it holds out sets of labeled data, then trains a model on the data that is leftover and evaluates this model on the held out set. For this evaluation any metric can be used. For example: accuracy, precision, recall and F-measure, and for computing these measures it needs labeled data.

Online version of Ridge Regression Classifier in ski-kitlearn?

I'm trying a range of Online classifiers in the ski-kitlearn library to train a model from huge data. I found there are many classifiers supporting the partial_fit allowing for incremental learning. I want to use the Ridge Regression classifier in this setting, but could not find it in the implementation. Is there an alternative model that can do this in sklearn?
sklearn.linear_model.SGDClassifier, its loss function contain ‘hinge’, ‘log’, ‘modified_huber’, ‘squared_hinge’, ‘perceptron’
sklearn.linear_model.SGDRegressor, its default loss function is squared_loss, The possible values are ‘squared_loss’, ‘huber’, ‘epsilon_insensitive’, or ‘squared_epsilon_insensitive’

sklearn calibrated classifier with random forest

Scikit has a very useful classifier wrappers called CalibratedClassifer and CalibratedClassifierCV, which try to make sure that the predict_proba function of a classifier really predicts a probability and not just an arbitrary number (albeit perhaps well-ranked) between zero and one.
However, when using random forests it is customary to use oob_decision_function_ to determine the performance on the training data, but this is no longer available when using the the calibrated models. The calibration should therefore work well for new data but not for the training data. How can we evaluate performance on the training data to determine, e.g., overfitting?
Apparently there really was no solution to this, and so I made a pull request to scikit-learn.
The problem was that the out-of-bag predictions are created during learning. Therefore, in the CalibratedClassifierCV each of the sub-classifiers does have its own oob decision function. However, this decision function is calculated on a fold of the data. Therefore, it is necessary to store each oob prediction (keeping nan values for samples that are not in the fold), then convert all the predictions using the calibration transformation, and then average the calibrated oob predictions to create an updated oob prediction.
As mentioned, I created a pull request at https://github.com/scikit-learn/scikit-learn/pull/11175. It will probably be a while before it is merged into the package, though, so if anyone really needs to use it then feel free to use my fork of scikit-learn at https://github.com/yishaishimoni/scikit-learn.

python scikit learn hyperparameter tuning with out of core learning

currently I am using
clf = SGDClassifier(loss='log', random_state=1, n_iter=1)
for my training a prediction model. However, the training data is quite large so I am using out of core learning.
clf.partial_fit(X_train, y_train, classes=classes)
Also, I would like to implement hyperparameter tuning through for instance GridSearchCV(http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
But it seems, as GridSearchCV does not provide a partial_fit method, that out-of-core learning is not possible and I have to keep the entire data set in memory. Is there a way for hyperparameter tuning while still using out-of-core learning?
I found a way to do incremental learning using Random Forest there is a library called scikit-graden they have a mondrian classifier that adds incremental or online learning to Random Forest.
Check this blog on mondrian forest:
https://medium.com/mlrecipies/mondrian-forests-making-random-forests-better-and-efficient-b27814c681e5

Categories