Sklearn joblib load function IO error from AWS S3 - python

I am trying to load a pkl dump of my classifier from sklearn-learn.
The joblib dump does a much better compression than the cPickle dump for my object so I would like to stick with it. However, I am getting an error when trying to read the object from AWS S3.
Cases:
Pkl object hosted locally: pickle.load works, joblib.load works
Pkl object pushed to Heroku with app (load from static folder): pickle.load works, joblib.load works
Pkl object pushed to S3: pickle.load works, joblib.load returns IOError. (testing from heroku app and tested from local script)
Note that the pkl objects for joblib and pickle are different objects dumped with their respective methods. (i.e. joblib loads only joblib.dump(obj) and pickle loads only cPickle.dump(obj).
Joblib vs cPickle code
# case 2, this works for joblib, object pushed to heroku
resources_dir = os.getcwd() + "/static/res/" # main resource directory
input = joblib.load(resources_dir + 'classifier.pkl')
# case 3, this does not work for joblib, object hosted on s3
aws_app_assets = "https://%s.s3.amazonaws.com/static/res/" % keys.AWS_BUCKET_NAME
classifier_url_s3 = aws_app_assets + 'classifier.pkl'
# does not work with raw url, IO Error
classifier = joblib.load(classifier_url_s3)
# urrllib2, can't open instance
# TypeError: coercing to Unicode: need string or buffer, instance found
req = urllib2.Request(url=classifier_url_s3)
f = urllib2.urlopen(req)
classifier = joblib.load(urllib2.urlopen(classifier_url_s3))
# but works with a cPickle object hosted on S3
classifier = cPickle.load(urllib2.urlopen(classifier_url_s3))
My app works fine in case 2, but because of very slow loading, I wanted to try and push all static files out to S3, particularly these pickle dumps. Is there something inherently different about the way joblib loads vs pickle that would cause this error?
This is my error
File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 409, in load
with open(filename, 'rb') as file_handle:
IOError: [Errno 2] No such file or directory: classifier url on s3
[Finished in 0.3s with exit code 1]
It is not a permissions issue as I've made all my objects on s3 public for testing and the pickle.dump objects load fine. The joblib.dump object also downloads if I directly enter the url into the browser
I could be completely missing something.
Thanks.

joblib.load() expects a name of the file present on filesystem.
Signature: joblib.load(filename, mmap_mode=None)
Parameters
-----------
filename: string
The name of the file from which to load the object
Moreover, making all your resources public might not be a good idea for other assets, even you don't mind pickled model being accessible to the world.
It is rather simple to copy object from S3 to local filesystem of your worker first:
from boto.s3.connection import S3Connection
from sklearn.externals import joblib
import os
s3_connection = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
s3_bucket = s3_connection.get_bucket(keys.AWS_BUCKET_NAME)
local_file = '/tmp/classifier.pkl'
s3_bucket.get_key(aws_app_assets + 'classifier.pkl').get_contents_to_filename(local_file)
clf = joblib.load(local_file)
os.remove(local_file)
Hope this helped.
P.S. you can use this approach to pickle the entire sklearn pipeline. This includes also feature imputation. Just beware of version conflicts of libraries between training and predicting.

Related

Problem loading ML model saved using joblib/pickle

I saved a jupyter notebook .pynb file to .pickle format using joblib.
My ML model is built using pandas, numpy and the statsmodels python library.
I saved the fitted model to a variable called fitted_model and here is how I used joblib:
from sklearn.externals import joblib
# Save RL_Model to file in the current working directory
joblib_file = "joblib_RL_Model.pkl"
joblib.dump(fitted_model, joblib_file)
I get this as output:
['joblib_RL_Model.pkl']
But when I try to load from file, in a new jupyter notebook, using:
# Load from file
joblib_file = "joblib_RL_Model.pkl"
joblib_LR_model = joblib.load(joblib_file)
joblib_LR_model
I only get this back:
<statsmodels.tsa.holtwinters.HoltWintersResultsWrapper at 0xa1a8a0ba8>
and no model. I was expecting to see the model load there and see the graph outputs as per original notebook.
Use with open, it is better because, it automatically open and close file. Also with proper mode.
with open('joblib_RL_Model.pkl', 'wb') as f:
pickle.dump(fitted_model, f)
with open('joblib_RL_Model.pkl', 'rb') as f:
joblib_LR_model = pickle.load(f)
And my implementation in Colab is here. Check it.
you can use more quantifiable package which is pickle default package for python to save models
you can use the the following function for the saving of ML Models
import pickle
def save_model(model):
pickle.dump(model, open("model.pkl", "wb"))
template for function would be
import pickle
def save_model(model):
pickle.dump(model, open(PATH_AND_FILE_NAME_TO_BE_SAVED, "wb"))
to load the model when saved it from pickle library you can follow the following function
def load_model(path):
return pickle.load(open(path, 'rb'))
Where path is the path and name to file where model is saved to.
Note:
This would only work for basic ML Models and PyTorch Models, it would not work for Tensorflow based models where you need to use
model.save(PATH_TO_MODEL_AND_NAME)
where model is of type tensorflow.keras.models

Serialize Python objects without file system

I want to serialize a trained scikit pipeline object to reload it for predictions. What I saw pickle and joblib dump are two common methods for that, whereas joblib is the preferable approach.
In my case I want to store the serialized python object in the database and load it from there, deserialize it and use it for predictions. Is it possible to serialize the object without any file system access?
Yes, for the pickle library you can get the serialized version of an object by using pickle.dumps instead of pickle.dump.
serialized_object = pickle.dumps(object)
This returns a bytes object, which you should then be able to store in your database, potentially converting it to base64 before doing so, or maybe directly.
You can do this:
import joblib
from io import BytesIO
import base64
with BytesIO() as tmp_bytes:
joblib.dump({"test": "test"}, tmp_bytes)
bytes_obj = tmp_bytes.getvalue()
base64_obj = base64.b64encode(bytes_obj)
Then, bytes_obj is a bytes object. And base64_obj is the base64 version. Select what you like.

How do I read a gzipped parquet file from S3 into Python using Boto3?

I have a file called data.parquet.gzip on my S3 bucket. I can't figure out what's the problem in reading it. Normally I've worked with StringIO but I don't know how to fix it. I want to import it from S3 into my Python jupyter notebook session using pandas and boto3.
The solution is actually quite straightforward.
import boto3 # For read+push to S3 bucket
import pandas as pd # Reading parquets
from io import BytesIO # Converting bytes to bytes input file
import pyarrow # Fast reading of parquets
# Set up your S3 client
# Ideally your Access Key and Secret Access Key are stored in a file already
# So you don't have to specify these parameters explicitly.
s3 = boto3.client('s3',
aws_access_key_id=ACCESS_KEY_HERE,
aws_secret_access_key=SECRET_ACCESS_KEY_HERE)
# Get the path to the file
s3_response_object = s3.get_object(Bucket=BUCKET_NAME_HERE, Key=KEY_TO_GZIPPED_PARQUET_HERE)
# Read your file, i.e. convert it from a stream to bytes using .read()
df = s3_response_object['Body'].read()
# Read your file using BytesIO
df = pd.read_parquet(BytesIO(df))
If you are using an IDE in your laptop/PC to connect to AWS S3 you may refer to the first solution of Corey:
import boto3
import pandas as pd
import io
s3 = boto3.resource(service_name='s3', region_name='XXXX',
aws_access_key_id='YYYY', aws_secret_access_key='ZZZZ')
buffer = io.BytesIO()
object = s3.Object(bucket_name='bucket_name', key='path/to/your/file.parquet')
object.download_fileobj(buffer)
df = pd.read_parquet(buffer)
If you are using Glue job you may refer to the second solution of Corey in the Glue script:
df = pd.read_parquet(path='s3://bucket_name/path/to/your/file.parquet')
In case you want to read a .json file (using an IDE in your laptop/PC):
object = s3.Object(bucket_name='bucket_name',
key='path/to/your/file.json').get()['Body'].read().decode('utf-8')
df = pd.read_json(object, lines=True)

How to load pickled dataframes from GCS into App Engine

I'm trying to load a pickled pandas dataframe from Google Cloud Storage into App Engine.
I have been using blob.download_to_file() to read the bytestream into pandas, however I encounter the following error:
UnpicklingError: invalid load key, m
I have tried seeking to the beginning to no avail and am pretty sure something fundamental is missing from my understanding.
When attempting to pass an open file object and read from there, I get an
UnsupportedOperation: write
error
from io import BytesIO
from google.cloud import storage
def get_byte_fileobj(project, bucket, path) -> BytesIO:
blob = _get_blob(bucket, path, project)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return(byte_stream)
def _get_blob(bucket_name, path, project):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return(blob)
fileobj = get_byte_fileobj(projectid, 'backups', 'Matches/Matches.pickle')
pd.read_pickle(fileobj)
Ideally pandas would just read from pickle since all of my GCS backups are in that format, but I'm open to suggestions.
The pandas.read_pickle() method takes as argument a file path string, not a file handler/object:
pandas.read_pickle(path, compression='infer')
Load pickled pandas object (or any object) from file.
path : str
File path where the pickled object will be loaded.
If you're in the 2nd generation standard or the flexible environment you could try to use a real /tmp file instead of BytesIO.
Otherwise you'd have to figure out another method of loading the data into pandas, which supports a file object/descriptor. In general the approach is described in How to restore Tensorflow model from Google bucket without writing to filesystem? (context is different, but same general idea)

Writing to s3 using imageio and boto3

I want to write images to aws s3. As a video plays i am trying to process images through some functions and then when its done I wish to store it to a specific path. imageio directly checks the extension in the name and writes the image for the appropriate file format.
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
obj = bucket.Object(filepath+'/'+second+'.jpg')
img.imwrite(obj)
If I were to write this to a local location and then write it to s3 then it works but is there a better way where I could store it to s3 without having to write it locally.
Any help is appreciated.
You can use something like BytesIO from python's io package, to create the file object in memory, and pass that to the boto3 client, like this:
from io import BytesIO
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
in_memory_file = BytesIO()
img.imwrite(in_memory_file)
obj = bucket.Object(filepath+'/'+second+'.jpg')
obj.upload_fileobj(in_memory_file)
This should solve the problem, without having to write the file to disk.

Categories