How to save pyspark model in to pickle file - python

How to save pyspark model in to pickle file
final_data=output_fixed.select('features','CreditabilityIndex')
test=final_data.randomSplit([0.7,0.3])
dtc=DecisionTreeClassifier(labelCol='CreditabilityIndex',featuresCol='features')
dtc_model=dtc.fit(train)

You can save the model using save() method where spark is a SparkContext object: docs
dtc_model.save(spark, "/path/to/file")

You could save your models in this fashion too -
lr = LogisticRegression(labelCol="label", featuresCol="features")
lr_model = lr.fit(train2)
lr_model.save("abc.model")
###This is how you can load it back -
sameModel = LogisticRegressionModel.load("abc.model")
PS - It will be saved in the location of your code file. However, sometimes you may not be able to see the actual file. But it will be saved for you to load again. So nothing to worry about.

Related

Problem loading ML model saved using joblib/pickle

I saved a jupyter notebook .pynb file to .pickle format using joblib.
My ML model is built using pandas, numpy and the statsmodels python library.
I saved the fitted model to a variable called fitted_model and here is how I used joblib:
from sklearn.externals import joblib
# Save RL_Model to file in the current working directory
joblib_file = "joblib_RL_Model.pkl"
joblib.dump(fitted_model, joblib_file)
I get this as output:
['joblib_RL_Model.pkl']
But when I try to load from file, in a new jupyter notebook, using:
# Load from file
joblib_file = "joblib_RL_Model.pkl"
joblib_LR_model = joblib.load(joblib_file)
joblib_LR_model
I only get this back:
<statsmodels.tsa.holtwinters.HoltWintersResultsWrapper at 0xa1a8a0ba8>
and no model. I was expecting to see the model load there and see the graph outputs as per original notebook.
Use with open, it is better because, it automatically open and close file. Also with proper mode.
with open('joblib_RL_Model.pkl', 'wb') as f:
pickle.dump(fitted_model, f)
with open('joblib_RL_Model.pkl', 'rb') as f:
joblib_LR_model = pickle.load(f)
And my implementation in Colab is here. Check it.
you can use more quantifiable package which is pickle default package for python to save models
you can use the the following function for the saving of ML Models
import pickle
def save_model(model):
pickle.dump(model, open("model.pkl", "wb"))
template for function would be
import pickle
def save_model(model):
pickle.dump(model, open(PATH_AND_FILE_NAME_TO_BE_SAVED, "wb"))
to load the model when saved it from pickle library you can follow the following function
def load_model(path):
return pickle.load(open(path, 'rb'))
Where path is the path and name to file where model is saved to.
Note:
This would only work for basic ML Models and PyTorch Models, it would not work for Tensorflow based models where you need to use
model.save(PATH_TO_MODEL_AND_NAME)
where model is of type tensorflow.keras.models

Access data on AML datastore from training script

I am looking for a working example how to access data on a Azure Machine Learning managed data store from within a train.py script. I followed the instructions in the link and my script is able to resolve the datastore.
However, whatever I tried (as_download(), as_mount()) the only thing I always got was a DataReference object. Or maybe I just don't understand how actually read data from a file with that.
run = Run.get_context()
exp = run.experiment
ws = run.experiment.workspace
ds = Datastore.get(ws, datastore_name='mydatastore')
data_folder_mount = ds.path('mnist').as_mount()
# So far this all works. But how to go from here?
You can pass in the DataReference object you created as the input to your training product (scriptrun/estimator/hyperdrive/pipeline). Then in your training script, you can access the mounted path via argument.
full tutorial: https://learn.microsoft.com/en-us/azure/machine-learning/service/tutorial-train-models-with-aml

How to use pickle to save sklearn model

I want to dump and load my Sklearn trained model using Pickle. How to do that?
Save:
import pickle
with open("model.pkl", "wb") as f:
pickle.dump(model, f)
Load:
with open("model.pkl", "rb") as f:
model = pickle.load(f)
In the specific case of scikit-learn, it may be better to use joblib’s
replacement of pickle (dump & load), which is more efficient on
objects that carry large numpy arrays internally as is often the case
for fitted scikit-learn estimators:
Save:
import joblib
joblib.dump(model, "model.joblib")
Load:
model = joblib.load("model.joblib")
Using pickle is same across all machine learning models irrespective of type i.e. clustering, regression etc.
To save your model in dump is used where 'wb' means write binary.
pickle.dump(model, open(filename, 'wb')) #Saving the model
To load the saved model wherever need load is used where 'rb' means read binary.
model = pickle.load(open(filename, 'rb')) #To load saved model from local directory
Here model is kmeans and filename is any local file, so use accordingly.
One can also use joblib
from joblib import dump, load
dump(model, model_save_path)

Loading previously saved JModelica result-file

I got the following question:
I am loading a JModelica model and simulate it easily by doing:
from pymodelica import compile_fmu
from pyfmi import load_fmu
model = load_fmu(SOME_FMU);
res=model.simulate();
Everything works fine and it even saves a resulting .txt - file. Now, with this .txt the problem is that I did not find any funtionality so far within the jmodelica-python packages to actually load such an .txt result file again later on into a result-object ( like the one being returned by simulate() ) to easily extract the previous saved data.
Implementing that by hand is of course possible but I find it quiet nasty and just wanted to ask if anyone knows of method that does the job to load that JModlica-format result-file into an result object for me.
Thanks!!!!
The functionality that you need is located in the io module:
from pyfmi.common.io import ResultDymolaTextual
res = ResultDymolaTextual("MyResult.txt")
var = res.get_variable_data("MyVariable")
var.x #Trajectory
var.t #Corresponding time vector

How to avoid loading a large file into a python script repeatedly?

I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.
My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:
def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list = []
train_val_list = []
for line in train_f:
list_line = line.strip().split("\t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)
return(train_id_list,train_val_array)
This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.
I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).
If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.
I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.
Then you can import another file with your model code, and run that with the training data as argument.
If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.
If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.
If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)
Simplest way would be to cache the results, like so:
_train_data_cache = {}
def load_cached_train_data(train_file):
if train_file not in _train_data_cache:
_train_data_cache[train_file] = load_train_data(train_file)
return _train_data_cache[train_file]
Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.
Load your data in ipython.
my_data = open("data.txt")
Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:
import sys
args = sys.argv
data = args[1]
...
Now run the python script in ipython:
%run example.py $mydata
Now, when running your python script, you don't need to load data multiple times.

Categories