I am working on creating an automated script to download files from a FTP and store them into BigQuery.
Problem is that BigQuery accepts only .csv files. For this reason I am looking for ways to handle also .xlsx and .xls files, conditional on that I am planning to put this batch code in the Cloud.
I am mentioning the latter as one way of going about to convert .xlsx files to .csv is to use something like:
import pandas as pd
data_xls = pd.read_excel('file_on_ftp.xlsx')
data_xls.to_csv('csvfile.csv', encoding='utf-8', index=False)
However, this will create local files somewhere in the temporary storage. Obviously with Cloud Functions I have to then monitor whether the file has been deleted afterwards or not, making it not reliable when one cloud function would Crash.
Are there therefore better ways to handle .xlsx loading into BigQuery? Or is this the way to go?
You might be interested this guide, which was just recently published: "Streaming data from Cloud Storage into BigQuery using Cloud Functions".
The general architecture would be:
upload the files in question from FTP to Cloud Storage
your Cloud Function receives the upload event from Cloud Storage
your Cloud Function loads the file into memory (no storage on disk)
your Cloud Function streams the data to BigQuery
I'm not sure if #1 suits your need to ensure that the file is not left behind somewhere, but I think it's probably the best possible solution if you are operating under the assumption that the file needs to be uploaded to GCP somewhere (the alternative would be streaming to BigQuery directly from your local machine, or an instance that is under your control).
I good method is to extract the file from FTP to GCS through Cloud Functions. Like #Dustin mentioned, you can then stream the data into BigQuery from GCS
Here's a way to extract files from FTP via Cloud Functions
#import libraries
from google.cloud import storage
import wget
def importFile(request):
#set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket('BUCKET-NAME') #without gs://
blob = bucket.blob('file-name.csv')
#See if file already exists
if blob.exists() == False:
#copy file to google storage
try:
link = 'ftp://account:password#ftp.domain.com/folder/file.csv' #for non-public ftp files
ftpfile = wget.download(link, out='/tmp/destination-file-name.csv') #save downloaded file in /tmp folder of Cloud Functions
blob.upload_from_filename(ftpfile)
print('Copied file to Google Storage!')
#print error if file doesn't exists
except BaseException as error:
print('An exception occurred: {}'.format(error))
#print error if file already exists in Google Storage
else:
print('File already exists in Google Storage')
Related
Is there any way to unload data from Snowflake to csv format or can it be directly stored in csv format in google cloud storage?
We are using composer(airflow) dags to connect to snowflake and unload data from tables into csv files and store them in google cloud storage and later on migrating it further.
What I have tried :
Querying data from snowflake table and getting it in a variable.
What I want to do further:
To convert the data into csv file(as have not run the code yet) and to migrate it to GCS bucket but seems like there is only GCStoGCSoperator in airflow which cannot help in this.
What I am thinking:
If I should use python file with scheduler instead of writing in DAG.
Doing it through dataflow(beam) and running it on composer.
Code:-
def func(**context):
dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
result = dwh_hook.get_first("select col1,col2,col3,col4,col5 from table_name where col_name = previous_date_func_here")
# print(result)
I have not yet tested it as I want to test it with GCS but seems like its not gonna work. What could be the ways?
Is it actually even possible with airflow to do this ?
Snowflake supports data unloading using COPY INTO location command:
Unloads data from a table (or query) into one or more files in one of the following locations:
Named internal stage (or table/user stage). The files can then be downloaded from the stage/location using the GET command.
Named external stage that references an external location (Amazon S3, Google Cloud Storage, or Microsoft Azure).
External location (Amazon S3, Google Cloud Storage, or Microsoft Azure).
Format Type Options (formatTypeOptions)
TYPE = CSV
TYPE = JSON
TYPE = PARQUET
Unloading Data from a Table Directly to Files in an External Location
Google Cloud Storage
Access the referenced GCS bucket using a referenced storage integration named myint:
COPY INTO 'gcs://mybucket/unload/'
FROM mytable
STORAGE_INTEGRATION = myint
FILE_FORMAT = (FORMAT_NAME = my_csv_format);
Related: Configuring an Integration for Google Cloud Storage
I am new in GCP as well as Python.
I am trying to read csv files which is present in google cloud storage and write to data into cloud sql table using Python.Can anyone help on that.Any help will be appreciated.
Thanks in advancs
You shouldn't read and load the data if you haven't update/cleaning to perform on data. You can use the Cloud SQL load CSV from Cloud Storage capability. It also work PostgreSQL (change on top on the page for this)
Do you need code example for calling a REST API in Python? (it's quite basic today, but maybe the security can annoy you!)
I like to use google-cloud-storage when operating with GCS. The package is basically a wrapper for GCloud's API.
Here's how you might use this library:
from google.cloud import storage
# create the client
path_to_service_account = "path/foo.json"
client = storage.Client.from_service_account_json(path_to_service_account)
# get the bucket
bucket_name = "my-bucket"
bucket = client.lookup_bucket(bucket_name)
# loop through the bucket to get the resource you want
for resource in bucket.list_blobs(prefix="dir-name/"):
# file type doesn't need to be a csv...
if resource.endswith("my_file.csv"):
my_blob = bucket.get_blob(resource)
my_blob_name = resource.split("/")[-1]
my_blob.download_to_filename(os.path.join("save-dir", my_blob_name))
# finally, load the file from local storage:
...
I am working in Python with Google Cloud ML-Engine. The documentation I have found indicates that data storage should be done with Buckets and Blobs
https://cloud.google.com/ml-engine/docs/tensorflow/working-with-cloud-storage
However, much of my code, and the libraries it calls works with files. Can I somehow treat Google Storage as a file system in my ml-engine code?
I want my code to read like
with open(<something>) as f:
for line in f:
dosomething(line)
Note that in ml-engine one does not create and configure VM instances. So I can not mount my own shared filesystem with Filestore.
The only way to have Cloud Storage appear as a filesystem is to mount a bucket as a file system:
You can use the Google Cloud Storage FUSE tool to mount a Cloud
Storage bucket to your Compute Engine instance. The mounted bucket
behaves similarly to a persistent disk even though Cloud Storage
buckets are object storage.
But you cannot do that if you can't create and configure VMs.
Note that in ml-engine one does not create and configure VM instances.
That's not entirely true. I see ML Engine supports building custom containers, which is typically how one can install and configure OS-level dependencies. But only for the training area, so if your needs are in that area it may be worth a try.
I assume you already checked that the library doesn't support access through an already open file-like handler (if not then maybe of interest would be How to restore Tensorflow model from Google bucket without writing to filesystem?)
For those that come after, here is the answer
Google Cloud ML and GCS Bucket issues
from tensorflow.python.lib.io import file_io
Here is an example
with file_io.FileIO("gc://bucket_name/foobar.txt","w") as f:
f.write("FOO")
f.flush()
print("Write foobar.txt")
with file_io.FileIO("gc://bucket_name/foobar.txt","r") as f:
for line in f:
print("Read foobar.txt: "+line)
For directories on a local machine, the os.walk() method is commonly used for walking a directory tree in Python.
Google has a Python module (google.cloud.storage) for uploading to and downloading from a GCP bucket in a locally-run Python script.
I need a way to walk directory trees in a GCP bucket. I browsed through the classes in the google.cloud Python module, but could not find anything. Is there a way to perform something similar to os.walk() on directories inside a GCP bucket?
No such function exists in the GCS library. However, GCS can list objects by prefix, which is usually sufficiently equivalent:
from google.cloud import storage
bucket = storage.Client().get_bucket(bucket_name)
for blob in bucket.list_blobs(prefix="dir1/"):
print(blob.name)
import os
from google.cloud import storage
client = storage.Client()
bucket = client.get_bucket('bucket_name')
for blob in bucket.list_blobs(prefix=''):
# Download the file
with open(blob.name, 'wb') as file_obj:
client.download_blob_to_file(blob, file_obj)
# You logic on the file
# logic goes here
# Remove the local file
os.remove(blob.name)
So the google ferris2 framework seems to exclusively use the blobstore api for the Upload component, making me question whether it's possible to make images uploaded to cloud storage public without having to write my own upload method and abandoning the use of the Upload component altogether, which also seems to create compatibility issues when using the cloud storage client library (python).
Backstory / context
using- google App engine, python, cloud storage client library
Requirements
0.5 We require that blob information nor the file be stored in the model. We want a public cloud serving url on the model and that is all. This seems to prevent us from using the normal ferris approach for uploading to cloud storage.
Things I already know / road blocks
One of the big roadblocks is dealing with Ferris using cgi / the blobstore api for field storage on the form. This seems to cause problems because so far it hasn't allowed sending data to to be sent to cloud storage through the google cloud storage python client.
Things we know about the google cloud storage python client and cgi:
To write data to cloud storage from our server, cloud storage needs to be called with cloudstorage.open("/bucket/object", "w", ...), (a cloud storage library method). However, it appears so far that a cgi.FieldStorage is returned from the post for the wtforms.fields.FileField() (as shown by a simple "print image" statement) before the data is applied to the model, after it is applied to the model, it is a blob store instance.
I would like verification on this:
after a lot of research and testing , it seems that because ferris is limited to the blobstore api for the uploads component, using the blob store api and blob keys to handle uploads seems basically unavoidable without having to create a second upload function just for the cloud storage call. Blob instances seem not to be compatible with that cloud storage client library, and it seems there is no way to get anything but meta data from blob files (without actually making a call to cloud storage to get the original file). However, it appears that this will not require storing extra data on the server. Furthermore, I believe it may be possible to get around the public link issue by setting the entire bucket to have read permissions.
Clarifying Questions:
1. To make uploaded images available to the public via our application, (any user, not an authenticated user), will I have to use the the cloudstorage python client library, or is there a way to do this with the blobstore api?
Is there a way to get the original file from a blob key (on save with the add action method) without actually having to make a call to cloud storage first, so that the file can be uploaded using that library?
If not, is there a way to grab the file from the cgi.FieldStorage, then send to cloud storage with the python client library? It seems that using cgi.FieldStorage.value is just meta data and not the file, same with cgi.FieldStorage.file.read()
1) You cannot use the GAE GCS client to update an ACL.
2) You can use the GCS json API after the blobstore upload to GCS and change the ACL to make it public. You do not have to upload again.
See this example code which inserts an acl.
3) Or use cgi.Fieldstorage to read the data (< 32 Mb) and write it to GCS using GAE GCS client.
import cloudstorage as gcs
import mimetypes
class UploadHandler(webapp2.RequestHandler):
def post(self):
file_data = self.request.get("file", default_value=None)
filename = self.request.POST["file"].filename
content_type = mimetypes.guess_type(self.filename)[0]
with gcs.open(filename, 'w', content_type=content_type or b'binary/octet-stream',
options={b'x-goog-acl': b'public-read'}) as f:
f.write(file_data)
A third method: use a form post upload with a GCS signed url and a policy document to control the upload.
And you can always use a public download handler, which reads files from the blobstore or GCS.
You can now specify the ACL when uploading a file from App Engine to Cloud Storage. Not sure how long it's been in place, just wanted to share:
filename = '/' + bucket_name + '/Leads_' + newUNID() + '.csv'
write_retry_params = gcs.RetryParams(backoff_factor=1.1)
gcs_file = gcs.open(filename,
'w',
content_type='text/csv',
options={'x-goog-acl': 'public-read'},
retry_params=write_retry_params)
docs: https://cloud.google.com/storage/docs/xml-api/reference-headers#standard