joblib.dump() seems to be the intended method for storing a trained sklearn models for later load and usage. I like the compression option and ease of use, but later loading with joblib.load() is slow. It takes 20 seconds to load a SVM model, trained on a reasonable small dataset (~10k texts). The model (stored with compress=3, as a recommended compromise) takes 100MB as dumped file.
For my own use (analysis), I needn't worry about speed of loading a model, but I have one I'd like to share with colleagues, and would like to make it as easy and quick as possible for them. I find examples of alternatives to joblib, e.g. pickle, json or hdfs. All are basically the same idea, dump a binary object to disk.
I find langid.py, as an example that I suspect is created in a similar manner as a sklearn model. To me it looks like the whole model is encoded as some kind of string (base64?) and just pasted into the script itself. Is this a solution? How is this done?
Are there any other strategies that might be viable?
Related
I'm looking for a way to save my sklearn Pipeline for data pre-processing so that I can re-load it to make predictions.
So far I've only seen options like pickle or joblib, that will serialize arbitrary python objects, but the resulting file
is opaque if I wanted to store the pipeline in version control,
will serialize any python object and therefore might not be safe to unserialize, and
may run into issues with different Python version or library versions
It seems like ONNX is a great way to save models in a safe & interoperable way -- Is there any alternative for data pre-processing pipelines?
I am using fasttext models in my python library (from the official fasttext library). To run my u-tests, I need at some point a model (fasttext.FastText._FastText object), as light as possible so that I can version it in my repo.
I have tried to create a fake text dataset with 5 lines "fake.txt" and a few words and called
model = fasttext.train_unsupervised("./fake.txt")
fasttext.util.reduce_model(model, 2)
model.save_model("fake_model.bin")
It basically works but the model is 16Mb. It is kind of ok for a U-test resource but do you think I can go below this?
Note that FastText (& similar dense word-vector models) don't perform meaningfully when using toy-sized data or parameters. (All their useful/predictable/testable benefits depend on large, varied datasets & the subtle arrangements of many final vectors.)
But, if you just need a relatively meaningless object/file of the right type, your approach should work. The main parameter that would make a FastText model larger without regard to the tiny training-set is the bucket parameter, with a default value of 2000000. It will allocate that many character-ngram (word-fragment) slots, even if all your actual words don't have that many ngrams.
Setting bucket to some far-smaller value, in initial model creation, should make your plug/stand-in file far smaller as well.
I have a model built by following roughly the tutorial provided for the tf.estimator.BoostedTreesClassifier in the docs. I then exported it by using the tf.Estimator.export_saved_model method as described in the SavedModels from Estimators section of the SavedModel docs. This loads in to TensorFlow Serving and answers gRPC and REST requests.
I'd now like to include the explanation factors along with any predictions. Or, less ideally, as a second signature available on the exported model. tf.estimator._BoostedTreesBase.experimental_predict_with_explanations already implements an appropriate algorithm, as described in Local Interpretability section of the docs.
I thought it would be possible to 'extend' the existing estimator in a way that would let me expose this method as another served signature. I've thought of several approaches, but only tried the first two so far:
I've Tried
Change which signatures export_saved_model exports
This didn't go very far. The exposed signatures are a little dynamic, but seem to be limited to the train, predict or eval options defined by tensorflow_core.python.saved_model.model_utils.mode_keys.KerasModeKeys.
Just use an eval_savedmodel?
I briefly thought Eval might be what I was looking for, and followed some of the getting started guide for TensorFlow Model Analysis. The further I go on this path the more it seems like the main difference with an Eval model is how the data is loaded, and that isn't what I want to change.
Subclass the estimator
There are extra caveats with exporting subclassed models. And on top of that an Estimator isn't a Model. It's a model with extra metadata around inputs, outputs and configuration, so I am not clear if a subclassed estimator would even be exportable in the same way a Keras Model is.
I abandoned this subclassing approach without writing much code.
Pull the BoostedTrees Model out of the Estimator
I am not savvy enough to arrange a BoostedTrees model myself, using the low-level primitives. The code in the Estimator that sets it up looks fairly complex. It would be nice to leverage that work, but it seems that the Estimator deals in model_fns, they change depending on the train/predict/eval mode, and it isn't clear what the relationship to a Keras Model is.
I wrote a little code for this, but also gave up on it quickly.
What Next?
Given the above dead ends, which angle should I be persuing further?
Both the low-level export API, and the low-level model building API look like they could get me closer to a solution. The gap between setting up an Estimator, and re-creating one using either API seems fairly wide.
Is it possible I could continue using the existing Estimator, but use the low-level export API to create something with an "interpret" signature that calls through to experimental_predict_with_explanations? Or even "predict and interpret" in a single step? Which tutorial will put me on that path?
My use case is I am training machine learners in python to find which model is most performant. But once I have finished choosing a model, I need to recapitulate that model in C for the shippable product.
What we have been planning to do is pick apart the model in python, save all its parameters to a file, and load these in to a different model with the same structure in C. But this strikes me as unnecessarily complex if there is actually a way to save a trained model as some kind of black-box, compiled library with a predict() function.
So is there a way to compile an object with all its data to a machine-code library? All I find about saving python objects is pickling them, which is clearly not what I need.
Does anyone know which function should I use if I want to use the pre-trained doc2vec models in this website https://github.com/jhlau/doc2vec?
I know we can use the Keyvectors.load_word2vec_format()to laod the word vectors from pre-trained word2vec models, but do we have a similar function to load pre-trained doc2vec models as well in gensim?
Thanks a lot.
When a model like Doc2Vec is saved with gensim's native save(), it can be reloaded with the native load() method:
model = Doc2Vec.load(filename)
Note that large internal arrays may have been saved alongside the main filename, in other filenames with extra extensions – and all those files must be kept together to re-load a fully-functional model. (You still need to specify only the main save file, and the auxiliary files will be discovered at expected names alongside it in the same directory.)
You may have other issues trying to use those pre-trained models. In particular:
as noted in the linked page, the author used a custom variant of gensim that forked off about 2 years ago; the files might not load in standard gensim, or later gensims
it's not completely clear what parameters were used to train those models (though I suppose if you succeed in loading them you could see them as properties in the model), and how much meta-optimization was used for which purposes, and whether those purposes will match your own project
if the parameters are as shown in one of the repo files, [train_model.py][1], some are inconsistent with best practices (a min_count=1 is usually bad for Doc2Vec) or apparent model-size (a mere 1.4GB model couldn't hold 300-dimensional vectors for all of the millions of documents or word-tokens in 2015 Wikipedia)
I would highly recommend training your own model, on a corpus you understand, with recent code, and using metaparameters optimized for your own purposes.
Try this:
import gensim.models as g
model="model_folder/doc2vec.bin" #point to downloaded pre-trained doc2vec model
#load model
m = g.Doc2Vec.load(model)