google ml-engine cloud storage as a file - python

I am working in Python with Google Cloud ML-Engine. The documentation I have found indicates that data storage should be done with Buckets and Blobs
https://cloud.google.com/ml-engine/docs/tensorflow/working-with-cloud-storage
However, much of my code, and the libraries it calls works with files. Can I somehow treat Google Storage as a file system in my ml-engine code?
I want my code to read like
with open(<something>) as f:
for line in f:
dosomething(line)
Note that in ml-engine one does not create and configure VM instances. So I can not mount my own shared filesystem with Filestore.

The only way to have Cloud Storage appear as a filesystem is to mount a bucket as a file system:
You can use the Google Cloud Storage FUSE tool to mount a Cloud
Storage bucket to your Compute Engine instance. The mounted bucket
behaves similarly to a persistent disk even though Cloud Storage
buckets are object storage.
But you cannot do that if you can't create and configure VMs.
Note that in ml-engine one does not create and configure VM instances.
That's not entirely true. I see ML Engine supports building custom containers, which is typically how one can install and configure OS-level dependencies. But only for the training area, so if your needs are in that area it may be worth a try.
I assume you already checked that the library doesn't support access through an already open file-like handler (if not then maybe of interest would be How to restore Tensorflow model from Google bucket without writing to filesystem?)

For those that come after, here is the answer
Google Cloud ML and GCS Bucket issues
from tensorflow.python.lib.io import file_io
Here is an example
with file_io.FileIO("gc://bucket_name/foobar.txt","w") as f:
f.write("FOO")
f.flush()
print("Write foobar.txt")
with file_io.FileIO("gc://bucket_name/foobar.txt","r") as f:
for line in f:
print("Read foobar.txt: "+line)

Related

Can I save files in a serverless server?

For instance, if I do something like
f = open("demofile.txt", "w")
f.write("test content")
f.close()
in a serverless environment like Google Cloud Run or Anthos (assume this is part of a web app), will demofile.txt exist permanently, and will I always be able to access it through f.read()?
Your question is strange but I will try to answer it.
You can write a file in serverless Product, such as Cloud Run, Cloud Functions and App Engine. The /tmp dir is writable. BUT it's an in memory file system. That means you can write and access your data from your instance (and not from the other instances) and the file persist up to the end of the instance. In addition, the file take space in the allocated memory. So, the storage limit is the memory size of your instance.
With Cloud Run, there is a new previous feature (released publicly only few days ago) that allows you to use a 2nd gen runtime and to use network file system(Google Cloud Storage with GCSFuse, or Filestore). It's "external storage" but viewed as local directory from your app.
A last point on Anthos (because you mentioned it in your question): Anthos is a suite of products that allows to manage from Google Cloud console, resources (mainly Kubernetes Clusters) running out of Google Cloud. There is a version of Cloud Run for Anthos, but Anthos isn't a serverless product itself.

How to load a dataset of pictures from Google Cloud Storage to a Jupyter VM Instance

I am new to gcloud storage and VMs. My problem is I have a big dataset of images stored in a bucket in gcloud, I want to do two things:
first, I want to load this dataset to my Jupyter instance to run some deep learning models. But I didn't find any clear method to do so.
second, I want to save some output images to a different bucket and so some other files.
Thank you in advance :)
You can use the Cloud Storage FUSE tool to mount a Cloud Storage bucket to your instance. The mounted bucket behaves similarly to a persistent disk even though Cloud Storage buckets are object storage. Once you have the buckets as disks you can make it so that the first bucket inputs the images into the deep learning model and the output be saved in the second bucket.

Allow Google Cloud Compute Engine Instance to write file to Google Storage Bucket - Python

In my python server script which is running on a google cloud VM instance, it tries to save an image(jpeg) in the storage. But it throws following error.
File "/home/thamindudj_16/server/object_detection/object_detector.py",
line 109, in detect Hand
new_img.save("slicedhand/{}#sliced_image{}.jpeg".format(threadname,
i)) File
"/home/thamindudj_16/.local/lib/python3.5/site-packages/PIL/Image.py",
line 2004, in save
fp = builtins.open(filename, "w+b")
OSError: [Errno 5] Input/output error: 'slicedhand/thread_1#sliced_image0.jpeg'
All the files including python scripts are in a google storage bucket and have mounted to the VM instance using gcsfuse. App tries to save new image in the slicedhand folder.
Python code snippet where image saving happen.
from PIL import Image
...
...
i = 0
new_img = Image.fromarray(bounding_box_img) ## conversion to an image
new_img.save("slicedhand/{}#sliced_image{}.jpeg".format(threadname, i))
I think may be the problem is with permission access. Doc says to use --key_file. But what is the key file I should use and where I can find that. I'm not clear whether this is the problem or something else.
Any help would be appreciated.
I understand that you are using gcfuse on your Linux VM Instance to access Google Cloud Storage.
Key file is a Service Account credentials key, that will allow you to initiate Cloud SDK or Client Library as another Service Account. You can download key file from Cloud Console. However, if you are using VM Instance, you are automatically using Compute Engine Default Service Account. You can check it using console command: $ gcloud init.
To configure properly your credentials, please follow the documentation.
Compute Engine Default Service Account, need to have enabled Access Scope Storage > Full. Access Scope is the mechanism that limits access level to Cloud APIs. That can be done during machine creation or when VM Instance is stopped.
Please note that Access Scopes are defined explicitly for the Service Account that you select for VM Instance.
Cloud Storage objects names have requirements. It is strongly recommended avoid using hash symbol "#" in the names of the objects.

Extracting excel files from the FTP to BigQuery using Cloud Functions

I am working on creating an automated script to download files from a FTP and store them into BigQuery.
Problem is that BigQuery accepts only .csv files. For this reason I am looking for ways to handle also .xlsx and .xls files, conditional on that I am planning to put this batch code in the Cloud.
I am mentioning the latter as one way of going about to convert .xlsx files to .csv is to use something like:
import pandas as pd
data_xls = pd.read_excel('file_on_ftp.xlsx')
data_xls.to_csv('csvfile.csv', encoding='utf-8', index=False)
However, this will create local files somewhere in the temporary storage. Obviously with Cloud Functions I have to then monitor whether the file has been deleted afterwards or not, making it not reliable when one cloud function would Crash.
Are there therefore better ways to handle .xlsx loading into BigQuery? Or is this the way to go?
You might be interested this guide, which was just recently published: "Streaming data from Cloud Storage into BigQuery using Cloud Functions".
The general architecture would be:
upload the files in question from FTP to Cloud Storage
your Cloud Function receives the upload event from Cloud Storage
your Cloud Function loads the file into memory (no storage on disk)
your Cloud Function streams the data to BigQuery
I'm not sure if #1 suits your need to ensure that the file is not left behind somewhere, but I think it's probably the best possible solution if you are operating under the assumption that the file needs to be uploaded to GCP somewhere (the alternative would be streaming to BigQuery directly from your local machine, or an instance that is under your control).
I good method is to extract the file from FTP to GCS through Cloud Functions. Like #Dustin mentioned, you can then stream the data into BigQuery from GCS
Here's a way to extract files from FTP via Cloud Functions
#import libraries
from google.cloud import storage
import wget
def importFile(request):
#set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket('BUCKET-NAME') #without gs://
blob = bucket.blob('file-name.csv')
#See if file already exists
if blob.exists() == False:
#copy file to google storage
try:
link = 'ftp://account:password#ftp.domain.com/folder/file.csv' #for non-public ftp files
ftpfile = wget.download(link, out='/tmp/destination-file-name.csv') #save downloaded file in /tmp folder of Cloud Functions
blob.upload_from_filename(ftpfile)
print('Copied file to Google Storage!')
#print error if file doesn't exists
except BaseException as error:
print('An exception occurred: {}'.format(error))
#print error if file already exists in Google Storage
else:
print('File already exists in Google Storage')

ferris2-framework, python, google app engine, cloud storage -- uploading an image and making it public?

So the google ferris2 framework seems to exclusively use the blobstore api for the Upload component, making me question whether it's possible to make images uploaded to cloud storage public without having to write my own upload method and abandoning the use of the Upload component altogether, which also seems to create compatibility issues when using the cloud storage client library (python).
Backstory / context
using- google App engine, python, cloud storage client library
Requirements
0.5 We require that blob information nor the file be stored in the model. We want a public cloud serving url on the model and that is all. This seems to prevent us from using the normal ferris approach for uploading to cloud storage.
Things I already know / road blocks
One of the big roadblocks is dealing with Ferris using cgi / the blobstore api for field storage on the form. This seems to cause problems because so far it hasn't allowed sending data to to be sent to cloud storage through the google cloud storage python client.
Things we know about the google cloud storage python client and cgi:
To write data to cloud storage from our server, cloud storage needs to be called with cloudstorage.open("/bucket/object", "w", ...), (a cloud storage library method). However, it appears so far that a cgi.FieldStorage is returned from the post for the wtforms.fields.FileField() (as shown by a simple "print image" statement) before the data is applied to the model, after it is applied to the model, it is a blob store instance.
I would like verification on this:
after a lot of research and testing , it seems that because ferris is limited to the blobstore api for the uploads component, using the blob store api and blob keys to handle uploads seems basically unavoidable without having to create a second upload function just for the cloud storage call. Blob instances seem not to be compatible with that cloud storage client library, and it seems there is no way to get anything but meta data from blob files (without actually making a call to cloud storage to get the original file). However, it appears that this will not require storing extra data on the server. Furthermore, I believe it may be possible to get around the public link issue by setting the entire bucket to have read permissions.
Clarifying Questions:
1. To make uploaded images available to the public via our application, (any user, not an authenticated user), will I have to use the the cloudstorage python client library, or is there a way to do this with the blobstore api?
Is there a way to get the original file from a blob key (on save with the add action method) without actually having to make a call to cloud storage first, so that the file can be uploaded using that library?
If not, is there a way to grab the file from the cgi.FieldStorage, then send to cloud storage with the python client library? It seems that using cgi.FieldStorage.value is just meta data and not the file, same with cgi.FieldStorage.file.read()
1) You cannot use the GAE GCS client to update an ACL.
2) You can use the GCS json API after the blobstore upload to GCS and change the ACL to make it public. You do not have to upload again.
See this example code which inserts an acl.
3) Or use cgi.Fieldstorage to read the data (< 32 Mb) and write it to GCS using GAE GCS client.
import cloudstorage as gcs
import mimetypes
class UploadHandler(webapp2.RequestHandler):
def post(self):
file_data = self.request.get("file", default_value=None)
filename = self.request.POST["file"].filename
content_type = mimetypes.guess_type(self.filename)[0]
with gcs.open(filename, 'w', content_type=content_type or b'binary/octet-stream',
options={b'x-goog-acl': b'public-read'}) as f:
f.write(file_data)
A third method: use a form post upload with a GCS signed url and a policy document to control the upload.
And you can always use a public download handler, which reads files from the blobstore or GCS.
You can now specify the ACL when uploading a file from App Engine to Cloud Storage. Not sure how long it's been in place, just wanted to share:
filename = '/' + bucket_name + '/Leads_' + newUNID() + '.csv'
write_retry_params = gcs.RetryParams(backoff_factor=1.1)
gcs_file = gcs.open(filename,
'w',
content_type='text/csv',
options={'x-goog-acl': 'public-read'},
retry_params=write_retry_params)
docs: https://cloud.google.com/storage/docs/xml-api/reference-headers#standard

Categories