What is the recommended way to parallelize the predict() method of a scikit-learn pipeline?
Here is a minimal working example that illustrates the problem by attempting parallel predict() on the iris data using an SVM pipeline and 5 parallel jobs:
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.externals.joblib import Parallel, delayed
from sklearn import datasets
# Load Iris data
iris = datasets.load_iris()
# Create a pipeline with 2 steps: scaler and SVM; train
pipe = make_pipeline(StandardScaler(), SVC()).fit(X=iris.data, y=iris.target)
# Split data array in 5 chunks
n_chunks = 5
n_samples = iris.data.shape[0]
slices = [(int(n_samples*i/n_chunks), int(n_samples*(i+1)/n_chunks)) for i in range(n_chunks)]
data_chunks = [iris.data[i[0]:i[1]] for i in slices]
# Setup 5 parallel jobs
jobs = (delayed(pipe.predict)(array) for array in data_chunks)
parallel = Parallel(n_jobs=n_chunks)
# Run jobs: fails
results = parallel(jobs)
This code fails with:
PicklingError: Can't pickle <function Pipeline.predict at 0x000000001746B730>: it's not the same object as sklearn.pipeline.Pipeline.predict
However, applying parallelization to the SVM classifier directly, instead of the pipeline, works:
# Load Iris data
iris = datasets.load_iris()
# Create SVM classifier, train
svc = SVC().fit(X=iris.data, y=iris.target)
# Split data array in 5 chunks
n_chunks = 5
n_samples = iris.data.shape[0]
slices = [(int(n_samples*i/n_chunks), int(n_samples*(i+1)/n_chunks)) for i in range(n_chunks)]
data_chunks = [iris.data[i[0]:i[1]] for i in slices]
# Setup 5 parallel jobs
jobs = (delayed(svc.predict)(array) for array in data_chunks)
parallel = Parallel(n_jobs=n_chunks)
# Run jobs: works
results = parallel(jobs)
I am able to work around the problem by essentially taking the pipeline apart: first applying scaling to the whole array, then splitting into chunks and parallelizing svc.predict() like above. However, it is inconvenient and generally voids advantages offered by pipelines: e.g. I have to keep track of intermediate results, and the code would have to change if additional transformer steps are added into the pipeline.
Is there a way to parallelize using pipelines directly?
Many thanks,
Aleksey
Related
I have data that I am trying to build a tree model on. The dataset is 6 million row
I'm using pandas udf successfully to train multiple Lightgbm models with on multiple nodes
Using Databricks with the following configuration
worker type: r5.24xlarge
Driver type: r5.24xlarge
Max workers: 12
Since pandas udf, supposedly, does the computing in parallel for each group, I thought that I could finish the training faster by using a cluster with more workers.
checking the Stage Detail page on the Web UI ,I see that I have only one stage responsible for the udf pandas training. It took 4 hours. Acessing the details of this stage, I see it had only one task, with {'Locality Level': 'PROCESS_LOCAL'}, that took the whole 4 hours. Screenshots below.
My assumption is the training is not being parallelized. I have attached the code below but I am not able to point where the mistake is
#reading data into pandas dataframe
train_gbm_x = s3_read_csv()
train_gbm_original = train_gbm_x.copy()
#create replicas of the dataset so that I can run 4 versions of these in parallel
for i in range(1,5,1):
x_t_c = train_gbm_x.copy()
x_t_c['id']=i
print(train_gbm_x_original.shape)
train_gbm_x_original =train_gbm_x_original.append([x_t_c], ignore_index=True)
train_gbm_sp = spark.createDataFrame(train_gbm_x_original) #convert to spark dataframe
# main UDF
schema = StructType([
StructField('r_squared', DoubleType(), True)])
#f.pandas_udf(schema, f.PandasUDFType.GROUPED_MAP)
def train_RF(data):
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from scipy.stats.stats import pearsonr
import numpy as np
import lightgbm as lgb
# params_xgb['objective'] = 'binary:logistic'
import xgboost as xgb
xgb_model = lgb.LGBMClassifier(bagging_fraction=0.9500000000000001, boosting_type='goss',
feature_fraction=0.53, learning_rate=0.045075773005757123,
max_depth=2, metric='binary_logloss', min_child_samples=70,
min_child_weight=0, min_data_in_bin=54, min_data_in_leaf=219,
min_gain_to_split=0.75, n_estimators=9900, num_leaves=1000,
objective='binary', random_state=4, reg_alpha=7.906971438083483,
reg_lambda=9.905206242985752, verbosity=-1)
# create data and label groups
y_train = data['label']
X_train = data.drop(['employee_id', 'date','label'], axis=1)
xgb_model.fit(X_train, y_train)
# make predictions
y_pred = xgb_model.predict(X_train)
r = pearsonr(y_pred, y_train)
print(y_pred)
# return the number of trees, and the R value
return pd.DataFrame({'r_squared': (r[0]**2)}, index=[0])
The call to the UDF is as follows:
train_gbm_sp= train_gbm_sp.sort(F.col("emp_id"),F.col("id"))
model_coeffs_r =train_gbm_sp.groupby('id').apply(train_RF)
model_coeffs_r.show(5)
spark UI screenshot:
This view shows that it is being run on a single executor ID
I am not sure where the exact issue is, if someone can help point to what I might be doing wrong in the set up here.
Thank you
I have a list called "Data" contained with 81 DataFrames (df1,df2...df81) with each DataFrame having the same shape and label. Let's say the independent variables (X) are 'a','b','c', and the dependent variable (Y) is 'y'. Can I perform a multivariate regression on each DataFrame inside list "Data" simultaneously instead of doing it one by one? and also storing each regression accuracy (r2_score) into accuracy_list?
e.g
I do regression with codes below
accuracy_list =[]
#First dataframe (df1)
X = Data['df1'][['a','b','c']]
Y = Data['df1']['y']
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,train_size=0.9,random_state=42)
from sklearn.linear_model import LinearRegression
rgs = LinearRegression()
rgs.fit(X_train,Y_train)
from sklearn.metrics import r2_score
y_pred = rgs.predict(X_test)
r2_score(Y_test,y_pred) # append it to accuracy_list
#second dataframe (df2)
X = Data['df2'][['a','b','c']]
Y = Data['df2']['y']
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,train_size=0.9,random_state=42)
rgs = LinearRegression()
rgs.fit(X_train,Y_train)
y_pred = rgs.predict(X_test)
r2_score(Y_test,y_pred) # append it to accuracy_list
# and so on
As I understand from your question, the important part is that you can process them in parallel to speed up the computation. Therefore, you could try using multiprocessing, which spins up various processes to execute your code. One very convenient way, that is also used under the hood in sci-kit learn would be to use joblib parallel here.
In code, that would roughly read as
from joblib import Parallel, delayed
def compute_r2_score(model, X, y) -> float:
y_pred = rgs.predict(X)
return r2_score(y, y_pred)
n_jobs = 2 # For having 2 processes. That should be at max n_cpus - 1
# verbose=10 gives you some output on the iterations
accuracy_list = Parallel(n_jobs=n_jobs, verbose=10)(delayed(compute_r2_score)(rgs, df[['a','b','c']], df['y']) for df in data.values())
Note that multiprocessing doesn't come for free and introduces additional communication and processing overhead. Apart from that, everything that runs in an individual process must be pickable, just in case you run into that issue.
As a side note, multithreading wouldn't speed up anything here due to the Global Interpreter Lock and this task being certainly CPU bound.
I have limited memory and training this model is taking too much:
import sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
clf = RandomForestClassifier(n_estimators=10)
print("Created Random Forest classifier\n")
data = pd.read_csv("House_2_ALL.csv")
print("Finished reading data\n")
data.drop("UnixTimeStamp",1)
predict = "Aggregate_Power"
print("Dropped UnixTimeStamp\n")
X = np.array(data.drop([predict],1))
Y = np.array(data[predict])
print("Created numpy Arrays\n")
X_train, X_test, Y_train, Y_test = sklearn.model_selection.train_test_split(X, Y, test_size = 0.1)
print("Assigned Testing/Training Variables\n")
clf.fit(X_train, Y_train)
print("Fit model\n")
print("Attempting to predict\n")
print(clf.predict(X_test))
When I run this program, my computer states that it has run out of memory and that I need to quit some applications.
Any ideas on how to manage memory better or is the only solution to reduce the size of my training dataset?
I have learned that the program runs smoothly until it gets to the "clf.fit(X_train, Y_train)" line so I don't know if this is a problem with pandas' memory hungry datafrmes or sklearn.
In my opinion, the size of your dataset is quite large. You should hence load your dataset in parts for training your model. I will share an example
df = pd.read_csv(dataset_path, chunksize=10000)
# This will load only 10000 rows at a time (you can tune for your RAM)
# Now the df is a generator and hence you can do something like this
for part_df in df:
'''
Now here you just consider the "part_df" as your original df and do all
the preprocessing and stuff on it and train the model on it. After training
the model on this part_df you save the model and reload it in the next iteration.
'''
df = preprocess_df(df) # Some preprocessing function
xtrain, xvalid, ytrain, yvalid = train_test_split(df) # Some split
model = None
if (os.exists(model_path)): # you won't have a model for first iteration
model = # Here you load the model
else:
model = # Define the model for first iteration of df
model.fit(...) # train the model
# Now you save the model for the next iteration
There are two possible scenarios here that could cause Memory error.
1.Pandas.read_csv() with chunk_size
You could use chunk_size parameters and load the data a smaller chunk at a time(returns an object we can iterate over).
chunk_size=50000
reader = pd.read_csv('big_file.csv', chunksize=chunk_size)
for i in range(num):
data_chunk = next(reader)
# process chunk
1.Random Forest Classifier/Regressor
It has default parameters max_depth=None,min_samples_leaf=1 which means full trees are grown. If the dataset is large then the RandomForest could grow fully deep trees and nodes leading to a faster memory consumption.
Let clf = RandomForestClassifier(),
clf.fit(X_train, y_train)
then you could check on few things like
print(clf.estimators_[0].tree_.max_depth) # max_depth on a chunk of data.
joblib.dump(clf.estimators_[0], "first_tree_clf.joblib") # get the size of a tree.
Now you can try a definite value for hyperparameter max_depth and again fit the model. Tuning of the RandomForest classifier model hyperparameters would create shallow trees per chunk and avoid too much of memory consumption
Is there a way to parallelize multiple model-building procedures in scikit-learn? I know that I can use the n_jobs argument in both GridSearchCV and cross_validate to achieve some sort of parallelization within one model building procedure. However, I am running multiple model-building procedures in a for-loop with different input parameters and save the results in a list. Just as an example, suppose I have 15 free CPUs and I am using n_jobs=5 in cross_validate. If I am not mistaken, that means that one single model-building procedure uses 5 CPUS. Now is there a way to already start the next 2 model-building procedures in my for-loop so I am using all 15 CPUS? Here's a dummy example:
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)
# define different types of penalty strategies
# let's make a toy example and pretend we would be interested in
# running different penalty strategies (I use three times 'l2' here,
# but imagine these would be different)
penalty_types = ['l2','l2','l2']
# define output list where we add the results using different penalty strategies
nested_cv_scores_list = []
for penalty_type in penalty_types:
# create a random number generator
rng = np.random.RandomState(42)
# z-standardize features
scaler = StandardScaler()
# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng,penalty=penalty_type)
# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=1,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}
# create pipeline
lr_pipe = Pipeline([
('scaler',scaler),
('lr',lr)
])
# define cross validation strategy
cv = KFold(shuffle=True,random_state=rng)
# implement GridSearch (inner cross validation)
grid = GridSearchCV(lr_pipe,param_grid=p_grid,cv=cv)
# implement cross_validate (outer cross validation)
nested_cv_scores = cross_validate(grid,X,y,cv=cv,n_jobs=5)
# append result to list
nested_cv_scores_list.append(nested_cv_scores)
Is there a way to parallelize this for-loop?
joblib.parallel is made for this job! Just put your loop content in a function and call it using Parallel and delayed
from joblib.parallel import Parallel, delayed
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import KFold, GridSearchCV, cross_validate
# load breast cancer data set
X,y = load_breast_cancer(return_X_y=True)
# define different types of penalty strategies
# let's make a toy example and pretend we would be interested in
# running different penalty strategies (I use three times 'l2' here,
# but imagine these would be different)
penalty_types = ['l2','l2','l2']
# define output list where we add the results using different penalty strategies
nested_cv_scores_list = []
# put rng-seed outside of loop so that not all results are the same
rng = np.random.RandomState(42)
def run_as_job(penalty_type, X, y):
# create a random number generator
# z-standardize features
scaler = StandardScaler()
# use linear L2-regularized Logistic Regression as classifier
lr = LogisticRegression(random_state=rng,penalty=penalty_type)
# define parameter grid to optimize over (optimize C)
lr_c = np.linspace(start=1,stop=16,num=11,endpoint=True)
p_grid = {'lr__C':lr_c}
.... # additional calculation that is missing in the example
.... # e.g. res = cross_val_score(clf, X, y, n_jobs=2)
return res
if __name__ == '__main__':
results = Parallel(n_jobs=2)(delayed(run_as_job)(penalty_type) for penalty_type in penalty_types)
for more usage options have a look at joblib: Embarrassingly parallel for loops
I wrote a classifier for Tweets in Python which then I saved it in .pkl format on disk, so I can run it again and again without the need to train it each time. This is the code:
import pandas
import re
from sklearn.feature_extraction import FeatureHasher
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn import cross_validation
from sklearn.externals import joblib
#read the dataset of tweets
header_row=['sentiment','tweetid','date','query', 'user', 'text']
train = pandas.read_csv("training.data.csv",names=header_row)
#keep only the right columns
train = train[["sentiment","text"]]
#remove puctuation, special characters, numbers and lower case the text
def remove_spch(text):
return re.sub("[^a-z]", ' ', text.lower())
train['text'] = train['text'].apply(remove_spch)
#Feature Hashing
def tokens(doc):
"""Extract tokens from doc.
This uses a simple regex to break strings into tokens.
"""
return (tok.lower() for tok in re.findall(r"\w+", doc))
n_features = 2**18
hasher = FeatureHasher(n_features=n_features, input_type="string", non_negative=True)
X = hasher.transform(tokens(d) for d in train['text'])
y = train['sentiment']
X_new = SelectKBest(chi2, k=20000).fit_transform(X, y)
a_train, a_test, b_train, b_test = cross_validation.train_test_split(X_new, y, test_size=0.2, random_state=42)
from sklearn.ensemble import RandomForestClassifier
classifier=RandomForestClassifier(n_estimators=10)
classifier.fit(a_train.toarray(), b_train)
prediction = classifier.predict(a_test.toarray())
#Export the trained model to load it in another project
joblib.dump(classifier, 'my_model.pkl', compress=9)
Let's say that I have another Python file and I want to classify a Tweet. How can I proceed to do the classification?
from sklearn.externals import joblib
model_clone = joblib.load('my_model.pkl')
mytweet = 'Uh wow:#medium is doing a crowdsourced data-driven investigation tracking down a disappeared refugee boat'
Up to the hasher.transform I can replicate the same procedure to add it to the prediction model, but then I have the problem that I cannot calculate the best 20k features. To use the SelectKBest, you need to add both features and label. Since, I want to predict the label, I cannot use the SelectKBest. So, how can I pass this issue to continue on the prediction?
I support the comment of #EdChum that
you build a model by training it on data which presumably is representative enough for it to cope with unseen data
Practically this means that you need to apply both FeatureHasher and SelectKBest to your new data with predict only. (It is wrong to train FeatureHasher anew on the new data, because in general it will produce different features).
To do this either
pickle FeatureHasher and SelectKBest separately
or (better)
make a Pipeline of FeatureHasher, SelectKBest, and RandomForestClassifier and pickle the whole pipeline. Then you can load this pipeline and use predict on a new data.