How to store scaling parameters for later use - python

I want to apply the scaling sklearn.preprocessing.scale module that scikit-learn offers for centering a dataset that I will use to train an svm classifier.
How can I then store the standardization parameters so that I can also apply them to the data that I want to classify?
I know I can use the standarScaler but can I somehow serialize it to a file so that I wont have to fit it to my data every time I want to run the classifier?

I think that the best way is to pickle it post fit, as this is the most generic option. Perhaps you'll later create a pipeline composed of both a feature extractor and scaler. By pickling a (possibly compound) stage, you're making things more generic. The sklearn documentation on model persistence discusses how to do this.
Having said that, you can query sklearn.preprocessing.StandardScaler for the fit parameters:
scale_ : ndarray, shape (n_features,)
Per feature relative scaling of the data.
New in version 0.17: scale_ is recommended instead of deprecated std_.
mean_ : array of floats with shape [n_features]
The mean value for each feature in the training set.
The following short snippet illustrates this:
from sklearn import preprocessing
import numpy as np
s = preprocessing.StandardScaler()
s.fit(np.array([[1., 2, 3, 4]]).T)
>>> s.mean_, s.scale_
(array([ 2.5]), array([ 1.11803399]))

Scale with standard scaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)
save mean_ and var_ for later use
means = scaler.mean_
vars = scaler.var_
(you can print and copy paste means and vars or save to disk with np.save....)
Later use of saved parameters
def scale_data(array,means=means,stds=vars **0.5):
return (array-means)/stds
scale_new_data = scale_data(new_data)

You can use the joblib module to store the parameters of your scaler.
from joblib import dump
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(data)
dump(scaler, 'scaler_filename.joblib')
Later you can load the scaler.
from joblib import load
scaler = load('scaler_filename.joblib')
transformed_data = scaler.transform(new_data)

Pickle brings a security vulnerability and allows attackers to execute arbitrary code on the servers. The conditions are:
possibility to replace the pickle file with another pickle file on the server (if no auditing of the pickle performed, i.e. signature validation or hash comparison)
the same but on the developer PC (attacker compromised some dev PC
If your server-side applications are executed as root (or under root in docker containers), then this is definitely worth of your attention.
Possible solution:
Model training should be done in a secure environment
Trained models should be signed by the key from another secure environment, which is not loaded to the gpg-agent (otherwise the attacker can quite easily replace the signature)
CI should test the models in an isolated environment (quarantine)
Use python3.8 or later which added security hooks to prevent code injection techniques
or just avoid pickle:)
Some links:
https://docs.python.org/3/library/pickle.html
Python: can I safely unpickle untrusted data?
https://github.com/pytorch/pytorch/issues/52596
https://www.python.org/dev/peps/pep-0578/
Possible approach to avoid pickling:
# scaler is fitted instance of MinMaxScaler
scaler_data_ = np.array([scaler.data_min_, scaler.data_max_])
np.save("my_scaler.npy", allow_pickle=False, scaler_data_)
#some not scaled X
Xreal = np.array([1.9261148646249848, 0.7327923702472628, 118, 1083])
scaler_data_ = np.load("my_scaler.npy")
Xmin, Xmax = scaler_data_[0], scaler_data_[1]
Xscaled = (Xreal - Xmin) / (Xmax-Xmin)
Xscaled
# -> array([0.63062502, 0.35320565, 0.15144766, 0.69116555])

Related

How do you create a spark dataframe on a worker node when using HyperOpt and SparkTrials?

I'm trying to run ML trials in parallel using HyperOpt with SparkTrials on Databricks.
My opjective function converts the outputs to a spark dataframe using spark.createDataFrame(results) (to reuse some preprocessing code I've previously created - I'd prefer not to have to rewrite this).
However, this causes an error when attempting to use HyperOpt and SparkTrials, as the SparkContext used to create the dataframe "should only be created or accessed on the driver". Is there any way I can create a sparkDataFrame in my objective function here?
For a reproducible example:
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC
from hyperopt import fmin, tpe, hp, SparkTrials, STATUS_OK, Trials
from pyspark.sql import SparkSession
# If you are running Databricks Runtime for Machine Learning, `mlflow` is already installed and you can skip the following line.
import mlflow
# Load the iris dataset from scikit-learn
iris = iris = load_iris()
X = iris.data
y = iris.target
def objective(C):
# Create a support vector classifier model
clf = SVC(C)
# THESE TWO LINES CAUSE THE PROBLEM
ss = SparkSession.builder.getOrCreate()
sdf = ss.createDataFrame([('Alice', 1)])
# Use the cross-validation accuracy to compare the models' performance
accuracy = cross_val_score(clf, X, y).mean()
# Hyperopt tries to minimize the objective function. A higher accuracy value means a better model, so you must return the negative accuracy.
return {'loss': -accuracy, 'status': STATUS_OK}
search_space = hp.lognormal('C', 0, 1.0)
algo=tpe.suggest
# THIS WORKS (It's not using SparkTrials)
argmin = fmin(
fn=objective,
space=search_space,
algo=algo,
max_evals=16)
from hyperopt import SparkTrials
spark_trials = SparkTrials()
# THIS FAILS
argmin = fmin(
fn=objective,
space=search_space,
algo=algo,
max_evals=16,
trials=spark_trials)
I have tried looking at this, but it is solving a different problem - I can't see an obvious way to apply it to my situation.
How can I get the current SparkSession in any place of the codes?
I think the short answer is that it's not possible. The spark context can only exist on the driver node. Creating a new instance would be a kind of nesting, see this related question.
Nesting parallelizations in Spark? What's the right approach?
I solved my problem in the end by rewriting the transformations in pandas, which would then work.
If the transformations are too big for a single node then you'd probably have to pre-compute them and let hyperopt choose which version as part of the optimisation.

CatBoost -- suppressing iteration results in a grid search

I am trying to use CatBoost Classifier. Using it I do perform a grid search using randomised_search() method. Unfortunately, the method prints to stdout iteration results for each tree built for each model tried.
There is a parameter supposed to control this: verbose. Ideally verbose could be set to False to inhibit all stdout prints, or set to an integer, specifying an interval between models which are reported (models, no trees).
Do you know how to control this? I get millions of lines in log files...
This question is somehow related to How to suppress CatBoost iteration results?, but that one related to the fit() method, which has a logging_level, silent parameters as well. Another method, the cv() cross validation, responds to logging_level='Silent' cutting out all output.
Setting both logging_level='Silent' when instantiating the model and verbose=False when running the random search should suppress all outputs.
import catboost
from sklearn.datasets import make_classification
from scipy import stats
# generate some data
X, y = make_classification(n_features=10)
# instantiate the model with logging_level='Silent'
model = catboost.CatBoostClassifier(iterations=1000, logging_level='Silent')
pool = catboost.Pool(X, y)
parameters = {
'learning_rate': stats.uniform(0.01, 0.1),
'depth': stats.binom(n=10, p=0.2)
}
# run random search with verbose=False
randomized_search_results = model.randomized_search(
parameters,
pool,
n_iter=10,
shuffle=False,
plot=False,
verbose=False,
)

What is Pool in Catboost?When to use Pool instead of numpy array?

I use this code to test CatBoostClassifier.
import numpy as np
from catboost import CatBoostClassifier, Pool
# initialize data
train_data = np.random.randint(0, 100, size=(100, 10))
train_labels = np.random.randint(0, 2, size=(100))
test_data = Pool(train_data, train_labels) #What is Pool?When to use Pool?
# test_data = np.random.randint(0,100, size=(20, 10)) #Usually we will use numpy array,will not use Pool
model = CatBoostClassifier(iterations=2,
depth=2,
learning_rate=1,
loss_function='Logloss',
verbose=True)
# train the model
model.fit(train_data, train_labels)
# make the prediction using the resulting model
preds_class = model.predict(test_data)
preds_proba = model.predict_proba(test_data)
print("class = ", preds_class)
print("proba = ", preds_proba)
The description about Pool is like this:
Pool used in CatBoost as a data structure to train model from.
I think usually we will use numpy array,will not use Pool.
For example we use:
test_data = np.random.randint(0,100, size=(20, 10))
I did not find any more usage of Pool, so I want to know when we will use Pool instead of numpy array?
Catboost only works with Pools, which is internal data format. If you pass numpy array to it, it will implicitly convert it to Pool first, without telling you.
If you need to apply many formulas to one dataset, using Pool drastically increases performance (like 10x), because you'll omit converting step each time.
My understanding of a Pool is that it is just a convenience wrapper combining features, labels and further metadata like categorical features or a baseline.
While it does not make a big difference if you first construct your pool and then fit your model using the pool, it makes a difference when it comes to saving your training data. If you save all the information separately it might get out of sync or you might forget something and when loading you need couple of lines to load everything. The pool comes in very handy here.
Note that when fitting you can also specify an evaluation dataset as a pool. If you want to try multiple evalutation datasets, it is quite handy to have them wrapped up in a single object - that's what the pools are for.
The most important thing about catboost is that we need not to encode the categorical features in our dataset. catBoost has in built one hot encoder hyperparameter, which can be used only when cat_features hyperparameter is specified. Now the cat_features hyperparameter is hard to define as error pops out as soon as we specify an array. The definition is made simpler using Pool.

How do you feed a tf.data.Dataset dynamically in eager execution mode where initializable_iterator isn't available?

What is the new approach (under eager execution) to feeding data through a dataset pipeline in a dynamic fashion, when we need to feed it sample by sample?
I have a tf.data.Dataset which performs some preprocessing steps and reads data from a generator, drawing from a large dataset during training.
Let's say that dataset is represented as:
ds = tf.data.Dataset.from_tensor_slices([1, 2, 3, 4, 5, 6])
ds = ds.map(tf.square).shuffle(2).batch(2)
iterator = tf.data.make_one_shot_iterator(ds)
After training I want to produce various visualizations which require that I feed one sample at a time through the network for inference. I've now got this dataset preprocessing pipeline that I need to feed my raw sample through to be sized and shaped appropriately for the network input.
This seems like a use case for the initializable iterator:
placeholder = tf.placeholder(tf.float32, shape=None)
ds = tf.data.Dataset.from_tensor_slices(placeholder)
ds = ds.map(tf.square).shuffle(2).batch(2)
iterator = tf.data.make_initializable_iterator(ds)
# now re-initialize for each sample
Keep in mind that the map operation in this example represents a long sequence of preprocessing operations that can't be duplicated for each new data sample being feed in.
This doesn't work with eager execution, you can't use the placeholder. The documentation examples all seem to assume a static input such as in the first example here.
The only way I can think of doing this is with a queue and tf.data.Dataset.from_generator(...) which reads from the queue that I push to before predicting on the data. But this feels both hacky, and appears prone to deadlocks that I've yet to solve.
TF 1.14.0
I just realized that the answer to this question is trivial:
Just create a new dataset!
In non-eager mode the code below would have degraded in performance because each dataset operation would have been added to the graph and never released, and in non-eager mode we have the initializable iterator to resolve that issue.
However, in eager execution mode tensorflow operations like this are ephemeral, added iterators aren't being added to a global graph, they just get created and die when no longer referenced. Win one for TF2.0!
The code below (copy/paste runnable) demonstrates:
import tensorflow as tf
import numpy as np
import time
tf.enable_eager_execution()
inp = np.ones(shape=5000, dtype=np.float32)
t = time.time()
while True:
ds = tf.data.Dataset.from_tensors(inp).batch(1)
val = next(iter(ds))
assert np.all(np.squeeze(val, axis=0) == inp)
print('Processing time {:.2f}'.format(time.time() - t))
t = time.time()
The motivation for the question came on the heels of this issue in 1.14 where creating multiple dataset operations in graph mode under Keras constitutes a memory leak.
https://github.com/tensorflow/tensorflow/issues/30448

Using dask as for task scheduling to run machine learning models in parallel

So basically what I want is to run ML Pipelines in parallel.
I have been using scikit-learn, and I have decided to use DaskGridSearchCV.
I have is a list of gridSearchCV = DaskGridSearchCV(pipeline, grid, scoring=evaluator) objects, and I run each of them sequentially:
for gridSearchCV in list:
gridSearchCV.fit(train_data, train_target)
predicted = gridSearchCV.predict(test_data)
If I have N different GridSearch objects, I want to take advantage as much as possible of all the available resources. If there are resources to run 2, 3, 4, ... or N at the same time in parallel, I want to do so.
So I started trying a few things based on dask's documentation. First I tried dask.threaded and dask.multiprocessing but it ends up being slower and I keep getting:
/Library/Python/2.7/site-packages/sklearn/externals/joblib/parallel.py:540: UserWarning: Multiprocessing backed parallel loops cannot be nested below threads, setting n_jobs=1
This is the code snippet:
def run_pipeline(self, gs, data):
train_data, test_data, train_target, expected = train_test_split(data, target, test_size=0.25, random_state=33)
model = gs.fit(train_data, train_target)
predicted = gs.predict(test_data)
values = [delayed(run_pipeline)(gs, df) for gs in gs_list]
compute(*values, get=dask.threaded.get)
Maybe I am approaching this the wrong way, would you have any suggestions for me?
Yes, but I have a list of GridSearch objects, for example one using DecisionTree and another with RandomForest. And I wanna run them in parallel as long as there are resources for it.
If this is your goal, I would merge them all into the same grid. Scikit-Learn Pipelines support grid-search across steps, which would allow you to do your search in only a single GridSearchCV object (for an example of this from the scikit-learn docs, see here). If you only have a single estimator (instead of a pipeline), you can use a Pipeline with a single step as a proxy. For example:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
import dask_searchcv as dcv
pipeline = Pipeline([('est', DecisionTreeClassifier())])
grid = [
{'est': [DecisionTreeClassifier()],
'max_features': ['sqrt', 'log2'],
# more parameters for DecisionTreeClassifier
},
{'est': [RandomForestClassifier()],
'max_features': ['sqrt', 'log2'],
# more parameters for RandomForesetClassifier
},
# more estimator/parameter subsets
]
gs = dcv.GridSearchCV(pipeline, grid)
gs.fit(train_data, train_target)
gs.predict(test_data)
Note that for this specific case (where all estimators share the same parameters, you can merge the grid:
grid = {'est': [DecisionTreeClassifier(), RandomForestClassifier()],
'max_features': ['sqrt', 'log2'],
# more parameters for all estimators}
As far as to why your delayed example didn't work - dask.delayed is for wrapping functions that don't call dask code. Since you're calling fit on a dask_searchcv.GridSearchCV object (which uses dask to compute) inside the delayed function (which also uses dask to compute), you're nesting calls to the dask scheduler, which can lead to poor performance at best, and weird bugs at worst.

Categories