Is there a way to parallelize multiple model-building procedures in scikit-learn? I know that I can use the n_jobs argument in both GridSearchCV and cross_validate to achieve some sort of parallelization within one model building procedure. However, I am running multiple model-building procedures in a for-loop with different input parameters and save the results in a list. Just as an example, suppose I have 15 free CPUs and I am using n_jobs=5 in cross_validate. If I am not mistaken, that means that one single model-building procedure uses 5 CPUS. Now is there a way to already start the next 2 model-building procedures in my for-loop so I am using all 15 CPUS? Here's a dummy example:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)
# define different types of penalty strategies
# let's make a toy example and pretend we would be interested in
# running different penalty strategies (I use three times 'l2' here,
# but imagine these would be different)
penalty_types = ['l2','l2','l2']
# define output list where we add the results using different penalty strategies
nested_cv_scores_list = []
for penalty_type in penalty_types:
# create a random number generator
rng = np.random.RandomState(42)
# z-standardize features
scaler = StandardScaler()
# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng,penalty=penalty_type)
# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=1,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}
# create pipeline
lr_pipe = Pipeline([
('scaler',scaler),
('lr',lr)
])
# define cross validation strategy
cv = KFold(shuffle=True,random_state=rng)
# implement GridSearch (inner cross validation)
grid = GridSearchCV(lr_pipe,param_grid=p_grid,cv=cv)
# implement cross_validate (outer cross validation)
nested_cv_scores = cross_validate(grid,X,y,cv=cv,n_jobs=5)
# append result to list
nested_cv_scores_list.append(nested_cv_scores)
Is there a way to parallelize this for-loop?
joblib.parallel is made for this job! Just put your loop content in a function and call it using Parallel and delayed
from joblib.parallel import Parallel, delayed
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)
# define different types of penalty strategies
# let's make a toy example and pretend we would be interested in
# running different penalty strategies (I use three times 'l2' here,
# but imagine these would be different)
penalty_types = ['l2','l2','l2']
# define output list where we add the results using different penalty strategies
nested_cv_scores_list = []
# put rng-seed outside of loop so that not all results are the same
rng = np.random.RandomState(42)
def run_as_job(penalty_type, X, y):
# create a random number generator
# z-standardize features
scaler = StandardScaler()
# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng,penalty=penalty_type)
# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=1,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}
.... # additional calculation that is missing in the example
.... # e.g. res = cross_val_score(clf, X, y, n_jobs=2)
return res
if __name__ == '__main__':
results = Parallel(n_jobs=2)(delayed(run_as_job)(penalty_type) for penalty_type in penalty_types)
for more usage options have a look at joblib: Embarrassingly parallel for loops
Related
I'm having a tough time trying to apply a postprocessing step with the sklearn2pmml packages. What I'm trying to do is to apply a linear transformation after applying the predict_proba method within the PMMMLPipeline class in sklearn2pmml package. Any idea about how to do this?
Even a solution outside this package but automatable would help me (like modifying automatically the XML from the PMML).
Here's an example so you can get a deeper understanding of what I'm trying to do:
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np
from sklearn2pmml.pipeline import PMMLPipeline
from sklearn2pmml import make_pmml_pipeline, sklearn2pmml
# FORGET ABOIT TRAIN TEST SPLIT; we only care if the PMML pipeline works for now
BIRTHDAY_SEED = 1995
nrows, cols = 1000, 5
X, y = make_classification(n_samples=nrows, n_features=cols, n_informative=2, n_redundant=3, n_classes=2, shuffle=True, random_state=BIRTHDAY_SEED)
X, y = pd.DataFrame(X), pd.Series(y)
model = DecisionTreeClassifier()
model.fit(X,y)
def postprocessig_linear_transformation(probabilities, a,b):
"This function would multiply proabilities by a and sum b"
return probabilities*a+b
# the pipeline should look like this
# first predict probabilities
probabilities = model.predict_proba(X)[:,0]
# then scale them (apply linear transformation)
probabilities_scaled = postprocessig_linear_transformation(probabilities, a = 1000, b=100)
# of course it does not work,
pmml_pipeline = PMMLPipeline([
# here we should place the category preprocesor; I know it does not work but , so you can get the idea
('decisiontree',model),
('postprocesing_apply_linear_transformation',postprocessig_linear_transformation)
])
sklearn2pmml(pmml_pipeline, "example_pipeline_pmml.pmml", with_repr = True)
On a second thought, you don't need a full-blown LinearRegression step to perform a deterministic a * x + b probability scaling operation. A simple ExpressionTransformer step is more than adequate:
from sklearn2pmml.preprocessing import ExpressionTransformer
pipeline = PMMLPipeline([
("decisiontree", model)
], predict_proba_transformer = ExpressionTransformer("X[0] * 1000 + 100"))
I'm having a tough time trying to apply a postprocessing step with the sklearn2pmml packages.
Don't blame the SkLearn2PMML package for your troubles. It is the Scikit-Learn framework that prohibits you from inserting two estimator objects into a single Pipeline object.
In the current case, you should rephrase your problem. What you're really trying to do is build a "chain of two models" (the first model feeding into the second model). The SkLearn2PMML package provides a sklearn2pmml.ensemble.EstimatorChain estimator type, which allows you to accomplish exactly that.
I'm currently using cross_val_score and KFold to assess the impact of using StandardScaler at different points within data pre-processing, specifically whether scaling the entire training dataset prior to performing cross validation introduces data leakage and what the effect of this is when compared to scaling the data from within a Pipeline (and therefore only applying it to the training folds).
my current process is as follows:
Experiment A
Import the boston housing dataset from sklearn.datasets and split into Data (X) and target (y)
create a Pipeline (sklearn.pipeline), that applies StandardScaler before applying linear regression
Specify the cross validation method as KFold with 5 folds
Perform cross validation (cross_val_score) using the above Pipeline and KFold method and observe the score
Experiment B
Use the same boston housing data as above
fit_transform StandardScaler on the entire dataset
Use cross_val_Score to perform cross validation on again 5 folds but this time input LinearRegression directly rather than a pipeline
Compare the scores here to Experiment A
The scores obtained are identical (to around 13 decimal places) which I question as surely Experiment B introduces Data Leakage during cross validation.
I've seen posts stating that it doesnt matter whether scaling is done on the entire training set before cross validation, if this is true I'm looking to understand why, if this isn't true I'd like to understand why the scores can still be so similar despite the data leakage?
See my test code below:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC
from sklearn.model_selection import KFold, StratifiedKFold
from sklearn.model_selection import cross_val_score, cross_val_predict
from sklearn.linear_model import LinearRegression
np.set_printoptions(15)
boston = datasets.load_boston()
X = boston["data"]
y = boston["target"]
scalar = StandardScaler()
clf = LinearRegression()
class StScaler(StandardScaler):
def fit_transform(self,X,y=None):
print('Length of Data on which scaler is fit on =', len(X))
output = super().fit(X,y)
# print('mean of scalar =',output.mean_)
output = super().transform(X)
return output
pipeline = Pipeline([('sc', StScaler()), ('estimator', clf)])
cv = KFold(n_splits=5, random_state=42)
cross_val_score(pipeline, X, y, cv = cv)
# Now fitting Scaler on whole train data
scaler_2 = StandardScaler()
clf_2 = LinearRegression()
X_ss = scaler_2.fit_transform(X)
cross_val_score(clf_2, X_ss, y, cv=cv)
Thanks!
I have a multi-class classification problem with 9 different classes. I am using the AdaBoostClassifier class from scikit-learn to train my model without using the one vs all technique, as the number of classes is very high and it might be inefficient.
I have tried using the tips from the documentation in scikit learn [1], but there the one vs all technique is used, which is substantially different. In my approach I only get one prediction per event, i.e. if I have n classes, the outcome of the prediction is a single value within the n classes. For the one vs all approach, on the other hand, the outcome of the prediction is an array of size n with a sort of likelihood value per class.
[1]
https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html#sphx-glr-auto-examples-model-selection-plot-roc-py
The code is:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Matplotlib plotting library for basic visualisation
%matplotlib inline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, auc
from sklearn import preprocessing
# Read data
df = pd.read_pickle('data.pkl')
# Create the dependent variable class
# This will substitute each of the n classes from
# text to number
factor = pd.factorize(df['target_var'])
df.target_var= factor[0]
definitions = factor[1]
X = df.drop('target_var', axis=1)
y = df['target_var]
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 0)
bdt_clf = AdaBoostClassifier(
DecisionTreeClassifier(max_depth=2),
n_estimators=250,
learning_rate=0.3)
bdt_clf.fit(X_train, y_train)
y_pred = bdt_clf.predict(X_test)
#Reverse factorize (converting y_pred from 0s,1s, 2s, etc. to their original values
reversefactor = dict(zip(range(9),definitions))
y_test_rev = np.vectorize(reversefactor.get)(y_test)
y_pred_rev = np.vectorize(reversefactor.get)(y_pred)
I tried directly with the roc curve function, and also binarising the labels, but I always get the same error message.
def multiclass_roc_auc(y_test, y_pred):
lb = preprocessing.LabelBinarizer()
lb.fit(y_test)
y_test = lb.transform(y_test)
y_pred = lb.transform(y_pred)
return roc_curve(y_test, y_pred)
multiclass_roc_auc(y_test, y_pred_test)
The error message is:
ValueError: multilabel-indicator format is not supported
How could this be sorted out? Am I missing some important concept?
What is the recommended way to parallelize the predict() method of a scikit-learn pipeline?
Here is a minimal working example that illustrates the problem by attempting parallel predict() on the iris data using an SVM pipeline and 5 parallel jobs:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.externals.joblib import Parallel, delayed
from sklearn import datasets
# Load Iris data
iris = datasets.load_iris()
# Create a pipeline with 2 steps: scaler and SVM; train
pipe = make_pipeline(StandardScaler(), SVC()).fit(X=iris.data, y=iris.target)
# Split data array in 5 chunks
n_chunks = 5
n_samples = iris.data.shape[0]
slices = [(int(n_samples*i/n_chunks), int(n_samples*(i+1)/n_chunks)) for i in range(n_chunks)]
data_chunks = [iris.data[i[0]:i[1]] for i in slices]
# Setup 5 parallel jobs
jobs = (delayed(pipe.predict)(array) for array in data_chunks)
parallel = Parallel(n_jobs=n_chunks)
# Run jobs: fails
results = parallel(jobs)
This code fails with:
PicklingError: Can't pickle <function Pipeline.predict at 0x000000001746B730>: it's not the same object as sklearn.pipeline.Pipeline.predict
However, applying parallelization to the SVM classifier directly, instead of the pipeline, works:
# Load Iris data
iris = datasets.load_iris()
# Create SVM classifier, train
svc = SVC().fit(X=iris.data, y=iris.target)
# Split data array in 5 chunks
n_chunks = 5
n_samples = iris.data.shape[0]
slices = [(int(n_samples*i/n_chunks), int(n_samples*(i+1)/n_chunks)) for i in range(n_chunks)]
data_chunks = [iris.data[i[0]:i[1]] for i in slices]
# Setup 5 parallel jobs
jobs = (delayed(svc.predict)(array) for array in data_chunks)
parallel = Parallel(n_jobs=n_chunks)
# Run jobs: works
results = parallel(jobs)
I am able to work around the problem by essentially taking the pipeline apart: first applying scaling to the whole array, then splitting into chunks and parallelizing svc.predict() like above. However, it is inconvenient and generally voids advantages offered by pipelines: e.g. I have to keep track of intermediate results, and the code would have to change if additional transformer steps are added into the pipeline.
Is there a way to parallelize using pipelines directly?
Many thanks,
Aleksey
I'm new to machine learning and trying Sklearn for the first time. I have two dataframes, one with data to train a logistic regression model (with 10-fold cross-validation) and another one to predict classes ('0,1') using that model.
Here's my code so far using bits of tutorials I found on Sklearn docs and on the Web:
import pandas as pd
import numpy as np
import sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold
from sklearn.preprocessing import normalize
from sklearn.preprocessing import scale
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
# Import dataframe with training data
df = pd.read_csv('summary_44.csv')
cols = df.columns.drop('num_class') # Data to use (num_class is the column with the classes)
# Import dataframe with data to predict
df_pred = pd.read_csv('new_predictions.csv')
# Scores
df_data = df.ix[:,:-1].values
# Target
df_target = df.ix[:,-1].values
# Values to predict
df_test = df_pred.ix[:,:-1].values
# Scores' names
df_data_names = cols.values
# Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
# Define number of folds
kf = KFold(n_splits=10)
kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator
# Logistic regression normalizing variables
LogReg = LogisticRegression()
# 10-fold cross-validation
scores = [LogReg.fit(X[train], y[train]).score(X[test], y[test]) for train, test in kf.split(X)]
print scores
# Predict new
novel = LogReg.predict(X_pred)
Is this the correct way to implement a Logistic Regression?
I know that the fit() method should be used after cross-validation in order to train the model and use it for predictions. However, since I called fit() inside a list comprehension I really don't know if my model was "fitted" and can be used to make predictions.
I general things are okay, but there are some problems.
Scaling
X, X_pred, y = scale(df_data), scale(df_test), df_target
You scale training and test data independently, which isn't correct. Both datasets must be scaled with the same scaler. "Scale" is a simple function, but it is better to use something else, for example StandardScaler.
scaler = StandardScaler()
scaler.fit(df_data)
X = scaler.transform(df_data)
X_pred = scaler.transform(df_test)
Cross-validation and predicting.
How your code works? You split data 10 times into train and hold-out set; 10 times fit model on train set and calculate score on hold-out set. This way you get cross-validation scores, but the model is fitted only on a part of data. So it would be better to fit model on the whole dataset and then make a prediction:
LogReg.fit(X, y)
novel = LogReg.predict(X_pred)
I want to notice that there are advanced technics like stacking and boosting, but if you learn using sklearn, then it is better to stick to the basics.