Read data from GCS and load into CLOUD SQL - python

I am new in GCP as well as Python.
I am trying to read csv files which is present in google cloud storage and write to data into cloud sql table using Python.Can anyone help on that.Any help will be appreciated.
Thanks in advancs

You shouldn't read and load the data if you haven't update/cleaning to perform on data. You can use the Cloud SQL load CSV from Cloud Storage capability. It also work PostgreSQL (change on top on the page for this)
Do you need code example for calling a REST API in Python? (it's quite basic today, but maybe the security can annoy you!)

I like to use google-cloud-storage when operating with GCS. The package is basically a wrapper for GCloud's API.
Here's how you might use this library:
from google.cloud import storage
# create the client
path_to_service_account = "path/foo.json"
client = storage.Client.from_service_account_json(path_to_service_account)
# get the bucket
bucket_name = "my-bucket"
bucket = client.lookup_bucket(bucket_name)
# loop through the bucket to get the resource you want
for resource in bucket.list_blobs(prefix="dir-name/"):
# file type doesn't need to be a csv...
if resource.endswith("my_file.csv"):
my_blob = bucket.get_blob(resource)
my_blob_name = resource.split("/")[-1]
my_blob.download_to_filename(os.path.join("save-dir", my_blob_name))
# finally, load the file from local storage:
...

Related

Snowflake|Airflow|Google Cloud : How to unload data from snowflake tables in csv format and upload to gcs using airflow

Is there any way to unload data from Snowflake to csv format or can it be directly stored in csv format in google cloud storage?
We are using composer(airflow) dags to connect to snowflake and unload data from tables into csv files and store them in google cloud storage and later on migrating it further.
What I have tried :
Querying data from snowflake table and getting it in a variable.
What I want to do further:
To convert the data into csv file(as have not run the code yet) and to migrate it to GCS bucket but seems like there is only GCStoGCSoperator in airflow which cannot help in this.
What I am thinking:
If I should use python file with scheduler instead of writing in DAG.
Doing it through dataflow(beam) and running it on composer.
Code:-
def func(**context):
dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
result = dwh_hook.get_first("select col1,col2,col3,col4,col5 from table_name where col_name = previous_date_func_here")
# print(result)
I have not yet tested it as I want to test it with GCS but seems like its not gonna work. What could be the ways?
Is it actually even possible with airflow to do this ?
Snowflake supports data unloading using COPY INTO location command:
Unloads data from a table (or query) into one or more files in one of the following locations:
Named internal stage (or table/user stage). The files can then be downloaded from the stage/location using the GET command.
Named external stage that references an external location (Amazon S3, Google Cloud Storage, or Microsoft Azure).
External location (Amazon S3, Google Cloud Storage, or Microsoft Azure).
Format Type Options (formatTypeOptions)
TYPE = CSV
TYPE = JSON
TYPE = PARQUET
Unloading Data from a Table Directly to Files in an External Location
Google Cloud Storage
Access the referenced GCS bucket using a referenced storage integration named myint:
COPY INTO 'gcs://mybucket/unload/'
FROM mytable
STORAGE_INTEGRATION = myint
FILE_FORMAT = (FORMAT_NAME = my_csv_format);
Related: Configuring an Integration for Google Cloud Storage

bigquery storage API: Is it possible to stream / save AVRO files directly to Google Cloud Storage?

I would like to export a 90 TB BigQuery table to Google Cloud Storage. According to the documentation, BigQuery Storage API (beta) should be the way to go due to export size quotas (e.g., ExtractBytesPerDay) associated with other methods.
The table is date-partitioned, with each partition occupying ~300 GB. I have a Python AI Notebook running on GCP, which runs partitions (in parallel) through this script adapted from the docs.
from google.cloud import bigquery_storage_v1
client = bigquery_storage_v1.BigQueryReadClient()
table = "projects/{}/datasets/{}/tables/{}".format(
"bigquery-public-data", "usa_names", "usa_1910_current"
) # I am using my private table instead of this one.
requested_session = bigquery_storage_v1.types.ReadSession()
requested_session.table = table
requested_session.data_format = bigquery_storage_v1.enums.DataFormat.AVRO
parent = "projects/{}".format(project_id)
session = client.create_read_session(
parent,
requested_session,
max_stream_count=1,
)
reader = client.read_rows(session.streams[0].name)
# The read stream contains blocks of Avro-encoded bytes. The rows() method
# uses the fastavro library to parse these blocks as an iterable of Python
# dictionaries.
rows = reader.rows(session)
Is it possible to save data from the stream directly to Google Cloud Storage?
I tried saving tables as AVRO files to my AI instance using fastavro and later uploading them to GCS using Blob.upload_from_filename(), but this process is very slow. I was hoping it would be possible to point the stream at my GCS bucket. I experimented with Blob.upload_from_file, but couldn't figure it out.
I cannot decode the whole stream to memory and use Blob.upload_from_string because I don't have over ~300 GB of RAM.
I spent the last two days parsing GCP documentation, but couldn't find anything, so I would appreciate your help, preferably with a code snippet, if at all possible. (If working with another file format is easier, I am all for it.)
Thank you!
Is it possible to save data from the stream directly to Google Cloud Storage?
By itself, the BigQuery Storage API is not capable of writing directly to GCS; you'll need to pair the API with code to parse the data, write it to local storage, and subsequently upload to GCS. This could be code that you write manually, or code from a framework of some kind.
It looks like the code snippet that you've shared processes each partition in a single-threaded fashion, which caps your throughput at the throughput of a single read stream. The storage API is designed to achieve high throughput through parallelism, so it's meant to be used with a parallel processing framework such as Google Cloud Dataflow or Apache Spark. If you'd like to use Dataflow, there's a Google-provided template you can start from; for Spark, you can use the code snippets that David has already shared.
An easy way to that would be to use Spark with the the spark-bigquery-connector? It uses the BigQuery Storage API in order to read the table directly into a Spark's DataFrame. You can create a Spark cluster on Dataproc, which is located as the same data centers as BigQuery and GCS, making the read and write speeds much faster.
A code example will look like this:
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.load()
df.write.format("avro").save("gs://bucket/path")
You can also filter the data and work on each partition separately:
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.option("filter", "the_date='2020-05-12'") \
.load()
# OR, in case you don't need to give the partition at load
df = spark.read.format("bigquery") \
.option("table", "bigquery-public-data.usa_names.usa_1910_current") \
.load()
df.where("the_date='2020-05-12'").write....
Please note that in order to read large amounts of data you would need a sufficiently large cluster.

Extracting excel files from the FTP to BigQuery using Cloud Functions

I am working on creating an automated script to download files from a FTP and store them into BigQuery.
Problem is that BigQuery accepts only .csv files. For this reason I am looking for ways to handle also .xlsx and .xls files, conditional on that I am planning to put this batch code in the Cloud.
I am mentioning the latter as one way of going about to convert .xlsx files to .csv is to use something like:
import pandas as pd
data_xls = pd.read_excel('file_on_ftp.xlsx')
data_xls.to_csv('csvfile.csv', encoding='utf-8', index=False)
However, this will create local files somewhere in the temporary storage. Obviously with Cloud Functions I have to then monitor whether the file has been deleted afterwards or not, making it not reliable when one cloud function would Crash.
Are there therefore better ways to handle .xlsx loading into BigQuery? Or is this the way to go?
You might be interested this guide, which was just recently published: "Streaming data from Cloud Storage into BigQuery using Cloud Functions".
The general architecture would be:
upload the files in question from FTP to Cloud Storage
your Cloud Function receives the upload event from Cloud Storage
your Cloud Function loads the file into memory (no storage on disk)
your Cloud Function streams the data to BigQuery
I'm not sure if #1 suits your need to ensure that the file is not left behind somewhere, but I think it's probably the best possible solution if you are operating under the assumption that the file needs to be uploaded to GCP somewhere (the alternative would be streaming to BigQuery directly from your local machine, or an instance that is under your control).
I good method is to extract the file from FTP to GCS through Cloud Functions. Like #Dustin mentioned, you can then stream the data into BigQuery from GCS
Here's a way to extract files from FTP via Cloud Functions
#import libraries
from google.cloud import storage
import wget
def importFile(request):
#set storage client
client = storage.Client()
# get bucket
bucket = client.get_bucket('BUCKET-NAME') #without gs://
blob = bucket.blob('file-name.csv')
#See if file already exists
if blob.exists() == False:
#copy file to google storage
try:
link = 'ftp://account:password#ftp.domain.com/folder/file.csv' #for non-public ftp files
ftpfile = wget.download(link, out='/tmp/destination-file-name.csv') #save downloaded file in /tmp folder of Cloud Functions
blob.upload_from_filename(ftpfile)
print('Copied file to Google Storage!')
#print error if file doesn't exists
except BaseException as error:
print('An exception occurred: {}'.format(error))
#print error if file already exists in Google Storage
else:
print('File already exists in Google Storage')

Secrets in a google cloud bucket

We want to have a production airflow environment but do not know how to deal properly with secrets, in particular google bigquery client JSON files
We tried setting up the kubernetes secrets on the automatically created kubernetes cluster (automatically by creationg a google cloud composer (airflow) environment). We currently just put the files on the bucket, but would like a better way.
def get_bq_client ():
""" returns bq client """
return bq.Client.from_service_account_json(
join("volumes", "bigquery.json")
)
We would like some form of proper management of the required secrets. Sadly, using Airflow Variables won't work because we can't create the client object using the json file as text
One solution that would work, is to encrypt the JSON files and put that on the bucket. As long as the decryption key exists on the bucket and no where else you'll be able to just check the code in with secrets to some source control and in the bucket checkout and decrypt.

ferris2-framework, python, google app engine, cloud storage -- uploading an image and making it public?

So the google ferris2 framework seems to exclusively use the blobstore api for the Upload component, making me question whether it's possible to make images uploaded to cloud storage public without having to write my own upload method and abandoning the use of the Upload component altogether, which also seems to create compatibility issues when using the cloud storage client library (python).
Backstory / context
using- google App engine, python, cloud storage client library
Requirements
0.5 We require that blob information nor the file be stored in the model. We want a public cloud serving url on the model and that is all. This seems to prevent us from using the normal ferris approach for uploading to cloud storage.
Things I already know / road blocks
One of the big roadblocks is dealing with Ferris using cgi / the blobstore api for field storage on the form. This seems to cause problems because so far it hasn't allowed sending data to to be sent to cloud storage through the google cloud storage python client.
Things we know about the google cloud storage python client and cgi:
To write data to cloud storage from our server, cloud storage needs to be called with cloudstorage.open("/bucket/object", "w", ...), (a cloud storage library method). However, it appears so far that a cgi.FieldStorage is returned from the post for the wtforms.fields.FileField() (as shown by a simple "print image" statement) before the data is applied to the model, after it is applied to the model, it is a blob store instance.
I would like verification on this:
after a lot of research and testing , it seems that because ferris is limited to the blobstore api for the uploads component, using the blob store api and blob keys to handle uploads seems basically unavoidable without having to create a second upload function just for the cloud storage call. Blob instances seem not to be compatible with that cloud storage client library, and it seems there is no way to get anything but meta data from blob files (without actually making a call to cloud storage to get the original file). However, it appears that this will not require storing extra data on the server. Furthermore, I believe it may be possible to get around the public link issue by setting the entire bucket to have read permissions.
Clarifying Questions:
1. To make uploaded images available to the public via our application, (any user, not an authenticated user), will I have to use the the cloudstorage python client library, or is there a way to do this with the blobstore api?
Is there a way to get the original file from a blob key (on save with the add action method) without actually having to make a call to cloud storage first, so that the file can be uploaded using that library?
If not, is there a way to grab the file from the cgi.FieldStorage, then send to cloud storage with the python client library? It seems that using cgi.FieldStorage.value is just meta data and not the file, same with cgi.FieldStorage.file.read()
1) You cannot use the GAE GCS client to update an ACL.
2) You can use the GCS json API after the blobstore upload to GCS and change the ACL to make it public. You do not have to upload again.
See this example code which inserts an acl.
3) Or use cgi.Fieldstorage to read the data (< 32 Mb) and write it to GCS using GAE GCS client.
import cloudstorage as gcs
import mimetypes
class UploadHandler(webapp2.RequestHandler):
def post(self):
file_data = self.request.get("file", default_value=None)
filename = self.request.POST["file"].filename
content_type = mimetypes.guess_type(self.filename)[0]
with gcs.open(filename, 'w', content_type=content_type or b'binary/octet-stream',
options={b'x-goog-acl': b'public-read'}) as f:
f.write(file_data)
A third method: use a form post upload with a GCS signed url and a policy document to control the upload.
And you can always use a public download handler, which reads files from the blobstore or GCS.
You can now specify the ACL when uploading a file from App Engine to Cloud Storage. Not sure how long it's been in place, just wanted to share:
filename = '/' + bucket_name + '/Leads_' + newUNID() + '.csv'
write_retry_params = gcs.RetryParams(backoff_factor=1.1)
gcs_file = gcs.open(filename,
'w',
content_type='text/csv',
options={'x-goog-acl': 'public-read'},
retry_params=write_retry_params)
docs: https://cloud.google.com/storage/docs/xml-api/reference-headers#standard

Categories