I saved a jupyter notebook .pynb file to .pickle format using joblib.
My ML model is built using pandas, numpy and the statsmodels python library.
I saved the fitted model to a variable called fitted_model and here is how I used joblib:
from sklearn.externals import joblib
# Save RL_Model to file in the current working directory
joblib_file = "joblib_RL_Model.pkl"
joblib.dump(fitted_model, joblib_file)
I get this as output:
['joblib_RL_Model.pkl']
But when I try to load from file, in a new jupyter notebook, using:
# Load from file
joblib_file = "joblib_RL_Model.pkl"
joblib_LR_model = joblib.load(joblib_file)
joblib_LR_model
I only get this back:
<statsmodels.tsa.holtwinters.HoltWintersResultsWrapper at 0xa1a8a0ba8>
and no model. I was expecting to see the model load there and see the graph outputs as per original notebook.
Use with open, it is better because, it automatically open and close file. Also with proper mode.
with open('joblib_RL_Model.pkl', 'wb') as f:
pickle.dump(fitted_model, f)
with open('joblib_RL_Model.pkl', 'rb') as f:
joblib_LR_model = pickle.load(f)
And my implementation in Colab is here. Check it.
you can use more quantifiable package which is pickle default package for python to save models
you can use the the following function for the saving of ML Models
import pickle
def save_model(model):
pickle.dump(model, open("model.pkl", "wb"))
template for function would be
import pickle
def save_model(model):
pickle.dump(model, open(PATH_AND_FILE_NAME_TO_BE_SAVED, "wb"))
to load the model when saved it from pickle library you can follow the following function
def load_model(path):
return pickle.load(open(path, 'rb'))
Where path is the path and name to file where model is saved to.
Note:
This would only work for basic ML Models and PyTorch Models, it would not work for Tensorflow based models where you need to use
model.save(PATH_TO_MODEL_AND_NAME)
where model is of type tensorflow.keras.models
Related
How to save pyspark model in to pickle file
final_data=output_fixed.select('features','CreditabilityIndex')
test=final_data.randomSplit([0.7,0.3])
dtc=DecisionTreeClassifier(labelCol='CreditabilityIndex',featuresCol='features')
dtc_model=dtc.fit(train)
You can save the model using save() method where spark is a SparkContext object: docs
dtc_model.save(spark, "/path/to/file")
You could save your models in this fashion too -
lr = LogisticRegression(labelCol="label", featuresCol="features")
lr_model = lr.fit(train2)
lr_model.save("abc.model")
###This is how you can load it back -
sameModel = LogisticRegressionModel.load("abc.model")
PS - It will be saved in the location of your code file. However, sometimes you may not be able to see the actual file. But it will be saved for you to load again. So nothing to worry about.
Tensorflow datasets or tfds automatically starts downloading the data I want. I have cifar10 downloaded in my system. I can directly load the data in pytorch using:
torchvision.datasets.CIFAR10('path/to/directory',...,download=False)
Is there a tensorflow or keras equivalent of this?
I think that the best you can do is first extract the tar file:
import tarfile
if fname.endswith("tar.gz"):
tar = tarfile.open(fname, "r:gz")
tar.extractall()
tar.close()
elif fname.endswith("tar"):
tar = tarfile.open(fname, "r:")
tar.extractall()
tar.close()
and then access to the model data and load it using keras:
https://www.tensorflow.org/api_docs/python/tf/keras/models/load_model
Posting another way to load the file from local for whoever finds it later.
In the URL parameter, you supply the local URL of the file with the path. For example, I have a file in my D drive in the folder Workspace/DataFiles/tldr.gz, then the value that I give for the URL parameter would be something like this.
path = 'file:///D:/Workspace/DataFiles/tldr.gz'
path_to_downloaded_file = tf.keras.utils.get_file("tldr_data",path, archive_format='tar', untar=True)`
This way keras recognizes the URL and loads the data from the file.
I want to dump and load my Sklearn trained model using Pickle. How to do that?
Save:
import pickle
with open("model.pkl", "wb") as f:
pickle.dump(model, f)
Load:
with open("model.pkl", "rb") as f:
model = pickle.load(f)
In the specific case of scikit-learn, it may be better to use joblib’s
replacement of pickle (dump & load), which is more efficient on
objects that carry large numpy arrays internally as is often the case
for fitted scikit-learn estimators:
Save:
import joblib
joblib.dump(model, "model.joblib")
Load:
model = joblib.load("model.joblib")
Using pickle is same across all machine learning models irrespective of type i.e. clustering, regression etc.
To save your model in dump is used where 'wb' means write binary.
pickle.dump(model, open(filename, 'wb')) #Saving the model
To load the saved model wherever need load is used where 'rb' means read binary.
model = pickle.load(open(filename, 'rb')) #To load saved model from local directory
Here model is kmeans and filename is any local file, so use accordingly.
One can also use joblib
from joblib import dump, load
dump(model, model_save_path)
I'm currently playing around with some neural networks in TensorFlow - I decided to try working with the CIFAR-10 dataset. I downloaded the "CIFAR-10 python" dataset from the website: https://www.cs.toronto.edu/~kriz/cifar.html.
In Python, I also tried directly copying the code that is provided to load the data:
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
However, when I run this, I end up with the following error: _pickle.UnpicklingError: invalid load key, '\x1f'. I've also tried opening the file using the gzip module (with gzip.open(file, 'rb') as fo:), but this didn't work either.
Is the dataset simply bad, or this an issue with code? If the dataset's bad, where can I obtain the proper dataset for CIFAR-10?
Extract your *.gz file and use this code
from six.moves import cPickle
f = open("path/data_batch_1", 'rb')
datadict = cPickle.load(f,encoding='latin1')
f.close()
X = datadict["data"]
Y = datadict['labels']
Just extract your tar.gz file, you will get a folder of data_batch_1, data_batch_2, ...
After that just use, the code provided to load data into your project :
def unpickle(file):
import pickle
with open(file, 'rb') as fo:
dict = pickle.load(fo, encoding='bytes')
return dict
dict = unpickle('data_batch_1')
It seems like that you need to unzip *gz file and then unzip *tar file to get a folder of data_batches. Afterwards you could apply pickle.load() on these batches.
I was facing the same problem using jupyter(vscode) and python3.8/3.7. I have tried to edit the source cifar.py cifar10.py but without success.
the solution for me was run these two lines of code in separate normal .py file:
from tensorflow.keras.datasets import cifar10
cifar10.load_data()
after that it worked fine on Jupyter.
Try this:
import pickle
import _pickle as cPickle
import gzip
with gzip.open(path_of_your_cpickle_file, 'rb') as f:
var = cPickle.load(f)
Try in this way
import pickle
import gzip
with gzip.open(path, "rb") as f:
loaded = pickle.load(f, encoding='bytes')
it works for me
I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.
My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:
def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list = []
train_val_list = []
for line in train_f:
list_line = line.strip().split("\t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)
return(train_id_list,train_val_array)
This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.
I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).
If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.
I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.
Then you can import another file with your model code, and run that with the training data as argument.
If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.
If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.
If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)
Simplest way would be to cache the results, like so:
_train_data_cache = {}
def load_cached_train_data(train_file):
if train_file not in _train_data_cache:
_train_data_cache[train_file] = load_train_data(train_file)
return _train_data_cache[train_file]
Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.
Load your data in ipython.
my_data = open("data.txt")
Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:
import sys
args = sys.argv
data = args[1]
...
Now run the python script in ipython:
%run example.py $mydata
Now, when running your python script, you don't need to load data multiple times.