How to load pickled dataframes from GCS into App Engine - python

I'm trying to load a pickled pandas dataframe from Google Cloud Storage into App Engine.
I have been using blob.download_to_file() to read the bytestream into pandas, however I encounter the following error:
UnpicklingError: invalid load key, m
I have tried seeking to the beginning to no avail and am pretty sure something fundamental is missing from my understanding.
When attempting to pass an open file object and read from there, I get an
UnsupportedOperation: write
error
from io import BytesIO
from google.cloud import storage
def get_byte_fileobj(project, bucket, path) -> BytesIO:
blob = _get_blob(bucket, path, project)
byte_stream = BytesIO()
blob.download_to_file(byte_stream)
byte_stream.seek(0)
return(byte_stream)
def _get_blob(bucket_name, path, project):
credentials = service_account.Credentials.from_service_account_file(
service_account_credentials_path) if service_account_credentials_path else None
storage_client = storage.Client(project=project, credentials=credentials)
bucket = storage_client.get_bucket(bucket_name)
blob = bucket.blob(path)
return(blob)
fileobj = get_byte_fileobj(projectid, 'backups', 'Matches/Matches.pickle')
pd.read_pickle(fileobj)
Ideally pandas would just read from pickle since all of my GCS backups are in that format, but I'm open to suggestions.

The pandas.read_pickle() method takes as argument a file path string, not a file handler/object:
pandas.read_pickle(path, compression='infer')
Load pickled pandas object (or any object) from file.
path : str
File path where the pickled object will be loaded.
If you're in the 2nd generation standard or the flexible environment you could try to use a real /tmp file instead of BytesIO.
Otherwise you'd have to figure out another method of loading the data into pandas, which supports a file object/descriptor. In general the approach is described in How to restore Tensorflow model from Google bucket without writing to filesystem? (context is different, but same general idea)

Related

What is the correct process for loading an excel workbook from blob storage?

I have an excel file stored in a azure blob storage container. I need this file in order to generate another excel file based on it. The issue I keep running into is when using openpyxl I use the following line of code to load the workbook:
wb = load_workbook(filename=)
I am not sure what to put in the area filename=. I thought it might need the URL of the excel blob inside the container.
That URL looks something like this: 'https://mystorage.blob.core.cloudapi.net/excel-files/myexcelfile.xlsx'
When I ran that code inside my azure notebook it throws this error:
FileNotFoundError: [Errno 2]: No such file or directory: 'https://mystorage.blob.core.cloudapi.net/excel-files/myexcelfile.xlsx'
Other solutions I read online said to use local memory, but this option will not work. I need to be able to do everything within the Azure ecosystem. If anyone knows how to load a workbook using openpyxl or another way for a file that exists inside an azure storage container I could use your assistance.
I am able to access the excel files and load them as panda df's by using the connection string, container name, and blob name and then connecting to the container_client and downloading like so:
conn_str = "abc123"
container = "a_container"
xl_blob = "a_xl"
download_blob = container_client.download_blob(a_xl)
df = pd.read_excel(download_blob.readall(), index=1)
Through this I can see the excel as a pandas df, but loading it through the workbook is tricky.
When I used that same "download_blob" variable in place of the filename= part, it throws the error:
TypeError: expected str, bytes or os.PathLike object, not StorageStreamDownloader
Thanks
download_blob is of type StorageStreamDownloader, so passing that into load_workbook is not going to work. Even passing download_blob.readall(), which is of type bytes, into load_workbook is not going to work. You need to write the bytes into a io.BytesIO, which is a file-like object, and pass that into load_workbook.
A io.BytesIO object is like a file that exists in memory only.
Something like this should work:
import io
import openpyxl
conn_str = "abc123"
container = "a_container"
xl_blob = "a_xl"
download_blob = container_client.download_blob(a_xl)
file = io.BytesIO(download_blob.readall())
wb = openpyxl.load_workbook(file)
ws = wb.active

Reading netCDF file from google cloud bucket directly into python

I have a bucket on the Google cloud that contains multiple netcdf files. Normally, when the files are stored locally, I would perform:
import netCDF4
nc = netCDF4.Dataset('path/to/netcdf.nc')
Is it possible to do this in python straight from the google cloud without having to first download the file from the bucket?
This function works for loading NetCDF files from a Google Cloud storage bucket:
import xarray as xr
import fsspec
def load_dataset(filename, engine="h5netcdf", *args, **kwargs) -> xr.Dataset:
"""Load a NetCDF dataset from local file system or cloud bucket."""
with fsspec.open(filename, mode="rb") as file:
dataset = xr.load_dataset(file, engine=engine, *args, **kwargs)
return dataset
dataset = load_dataset("gs://bucket-name/path/to/file.nc")
I'm not sure how to work with Google object store, but here's how you can open a netCDF file from an in-memory buffer containing all the bytes from the file:
from netCDF4 import Dataset
fobj = open('path/to/netcdf.nc', 'rb')
data = fobj.read()
nc = Dataset('memory', memory=data)
So the path forward would be to read all the data from object store, then use that command to read it. That will have some drawbacks for large netcdf files because you're putting all those bytes in your system memory.

Reading excel files from "input" blob storage container and exporting to csv in "output" container with python

I'm trying to develop a script in python to read a file in .xlsx from a blob storage container called "source", convert it in .csv and store it in a new container (I'm testing the script locally, if working I should include it in an ADF pipeline). So far, I managed to access to the blob storage, but I'm having problems in reading the file content.
from azure.storage.blob import BlobServiceClient, ContainerClient, BlobClient
import pandas as pd
conn_str = "DefaultEndpointsProtocol=https;AccountName=XXXXXX;AccountKey=XXXXXX;EndpointSuffix=core.windows.net"
container = "source"
blob_name = "prova.xlsx"
container_client = ContainerClient.from_connection_string(
conn_str=conn_str,
container_name=container
)
# Download blob as StorageStreamDownloader object (stored in memory)
downloaded_blob = container_client.download_blob(blob_name)
df = pd.read_excel(downloaded_blob)
print(df)
I get following error:
ValueError: Invalid file path or buffer object type: <class 'azure.storage.blob._download.StorageStreamDownloader'>
I tried with a .csv file as input and writing the parsing code as follows:
df = pd.read_csv(StringIO(downloaded_blob.content_as_text()) )
and it works.
Any suggestion on how to modify the code so that the excel file becomes readable?
I summary the solution as below.
When we use the method pd.read_excel() in sdk pandas, we need to provide bytes as input. But when we use download_blob to download the excel file from azure blob, we just get azure.storage.blob.StorageStreamDownloader. So we need to use the method readall() or content_as_bytes() to convert it to bytes. For more details, please refer to the document and the document
Change
df = pd.read_excel(downloaded_blob)
to
df = pd.read_excel(downloaded_blob.content_as_bytes())

Writing to s3 using imageio and boto3

I want to write images to aws s3. As a video plays i am trying to process images through some functions and then when its done I wish to store it to a specific path. imageio directly checks the extension in the name and writes the image for the appropriate file format.
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
obj = bucket.Object(filepath+'/'+second+'.jpg')
img.imwrite(obj)
If I were to write this to a local location and then write it to s3 then it works but is there a better way where I could store it to s3 without having to write it locally.
Any help is appreciated.
You can use something like BytesIO from python's io package, to create the file object in memory, and pass that to the boto3 client, like this:
from io import BytesIO
s3 = boto3.resource('s3')
bucket = s3.Bucket(bucket)
in_memory_file = BytesIO()
img.imwrite(in_memory_file)
obj = bucket.Object(filepath+'/'+second+'.jpg')
obj.upload_fileobj(in_memory_file)
This should solve the problem, without having to write the file to disk.

Sklearn joblib load function IO error from AWS S3

I am trying to load a pkl dump of my classifier from sklearn-learn.
The joblib dump does a much better compression than the cPickle dump for my object so I would like to stick with it. However, I am getting an error when trying to read the object from AWS S3.
Cases:
Pkl object hosted locally: pickle.load works, joblib.load works
Pkl object pushed to Heroku with app (load from static folder): pickle.load works, joblib.load works
Pkl object pushed to S3: pickle.load works, joblib.load returns IOError. (testing from heroku app and tested from local script)
Note that the pkl objects for joblib and pickle are different objects dumped with their respective methods. (i.e. joblib loads only joblib.dump(obj) and pickle loads only cPickle.dump(obj).
Joblib vs cPickle code
# case 2, this works for joblib, object pushed to heroku
resources_dir = os.getcwd() + "/static/res/" # main resource directory
input = joblib.load(resources_dir + 'classifier.pkl')
# case 3, this does not work for joblib, object hosted on s3
aws_app_assets = "https://%s.s3.amazonaws.com/static/res/" % keys.AWS_BUCKET_NAME
classifier_url_s3 = aws_app_assets + 'classifier.pkl'
# does not work with raw url, IO Error
classifier = joblib.load(classifier_url_s3)
# urrllib2, can't open instance
# TypeError: coercing to Unicode: need string or buffer, instance found
req = urllib2.Request(url=classifier_url_s3)
f = urllib2.urlopen(req)
classifier = joblib.load(urllib2.urlopen(classifier_url_s3))
# but works with a cPickle object hosted on S3
classifier = cPickle.load(urllib2.urlopen(classifier_url_s3))
My app works fine in case 2, but because of very slow loading, I wanted to try and push all static files out to S3, particularly these pickle dumps. Is there something inherently different about the way joblib loads vs pickle that would cause this error?
This is my error
File "/usr/local/lib/python2.7/site-packages/sklearn/externals/joblib/numpy_pickle.py", line 409, in load
with open(filename, 'rb') as file_handle:
IOError: [Errno 2] No such file or directory: classifier url on s3
[Finished in 0.3s with exit code 1]
It is not a permissions issue as I've made all my objects on s3 public for testing and the pickle.dump objects load fine. The joblib.dump object also downloads if I directly enter the url into the browser
I could be completely missing something.
Thanks.
joblib.load() expects a name of the file present on filesystem.
Signature: joblib.load(filename, mmap_mode=None)
Parameters
-----------
filename: string
The name of the file from which to load the object
Moreover, making all your resources public might not be a good idea for other assets, even you don't mind pickled model being accessible to the world.
It is rather simple to copy object from S3 to local filesystem of your worker first:
from boto.s3.connection import S3Connection
from sklearn.externals import joblib
import os
s3_connection = S3Connection(AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
s3_bucket = s3_connection.get_bucket(keys.AWS_BUCKET_NAME)
local_file = '/tmp/classifier.pkl'
s3_bucket.get_key(aws_app_assets + 'classifier.pkl').get_contents_to_filename(local_file)
clf = joblib.load(local_file)
os.remove(local_file)
Hope this helped.
P.S. you can use this approach to pickle the entire sklearn pipeline. This includes also feature imputation. Just beware of version conflicts of libraries between training and predicting.

Categories