Write MySQL result to a CSV file in Google Cloud Storage

Write MySQL result to a CSV file in Google Cloud Storage - python

I need to write a result set from MySQL ina csv format inside a bucket in Google Cloud Storage.
Following the instructions here, I created the following example code:
import cloudstorage
from google.appengine.api import app_identity
import db # My own Mysql wrapper
dump = db.get_table_dump(schema) # Here I made a simples SQL SELECT and fetchall()
bucket_name = app_identity.get_default_gcs_bucket_name()
file_name = "/" + bucket_name + "/finalfiles/" + schema + "/" +"myfile.csv"
with cloudstorage.open(file_name, "w") as gcsFile:
gcsFile.write(dump)
It did not work 'cause write expects a string parameter and dump is tuple of tuples result from fetchall().
I can't use this approach (or similar) since I can't write files in GAE enviroment and I also can't create a CSV string from tuple like here, due to the size o my result set (Actually, I tried it and it takes too long and it timed out before finish).
So, my question is, which is the best way to get a result set from MySQL and save it as CSV in a Google Cloud Storage Bucket?

I just went through the same problem with PHP. I ended up using the cloud sql api (https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/) with the following workflow:
Create an export bucket (i.e. test-exports)
Give the SQL Instance Read/Write permissions to the bucket created in step 1
Within the application, make a call to instance export (https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/instances/export). This endpoint accepts the SQL to run, as well as a path to an output bucket. (created in step (1))
Step (3) will return back an operation with a 'name' property. You can use this 'name' and poll the operations/get endpoint (https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/operations/get) until the status is returned as DONE
We have a job which performs these steps nightly (as well as an import using the /import command) on 6 tables and have yet to see any issues. The only thing to keep in mind is that only one operation can run on a single database instance at a time. To combat this, you should the top item from the the operations list endpoint (https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/operations/list) to confirm the database is ready before issuing any commands.

Related

How Python AWS Lambda interacts specifically with the uploaded file?

I´m trying to do the following:
when I upload a file in my s3 storage, the lambda picks this json file and converts it into a csv file.
How can I specify in the lambda code which file must pick?
example of my code in local:
import pandas as pd
df = pd.read_json('movies.json')
df.to_csv('csv-movies.csv')
in this example, I provide the name of the file...but..how can I manage that on a Lambda?
I think I don´t understand how Lambda works...could you give me an example?

Lambda spins up execution environments to handle your requests. When it initialises these environments, it'll pull the code you uploaded, and execute it when invoked.
Execution environments have a concept of ephemeral (temporary) storage with a default size of 512mb.
Lambda doesn't have access to your files in S3 by default. You'd first need to download your file from S3 using something like the AWS SDK for Python. You can store it in the /tmp directory to make use of the ephemeral storage I mentioned earlier.
Once you've downloaded the file using the SDK, you can interact with it as you would if you were running this locally, like in your example.
On the flip side, you'd also need to use the SDK to upload the CSV back to S3 if you want to keep it beyond the lifecycle of that execution environment.
Something else you might want to explore in future is reading that file into memory and doing away with storing it in ephemeral storage altogether.

In order to achieve this you will need to use S3 as the event source for your Lambda, there's a useful tutorial for this provided by AWS themselves and has some sample python code to assist you, you can view it here.
To break it down slightly further and answer how you get the name of the file. The lambda handler will look similar to the following:
def lambda_handler(event, context)
What is important here is the event object. When your event source is the S3 bucket you will be given the name of the bucket and the s3 key in the object which is effectively the path to the file in the S3 bucket. With this information you can do some logic to decide if you want to download the file from that path. If you do, you can use the S3 get_object( ) api call as shown in the tutorial.
Once this file is downloaded it can be used like any other file you would have on your local machine, so you can then proceed to process the json to a CSV. Once it is converted you will presumably want to put it back in S3 and for this you can use the S3 put_object( ) call for this and reuse the information in the event object in order to specify the path.

Loading a new CSV in Azure Blob Storage to SQL DB

I am loading a csv file into an Azure Blob Storage account. I would like a process to be triggered when a new file is added, that takes the new CSV and BCP loads it into an Azure SQL database.
My idea is to have an Azure Data Factory pipeline that is event triggered. However, I am stuck as to what to do next. Should an Azure Function be triggered that takes this CSV and uses BCP to load it into the DB? Can Azure Functions even use BCP?
I am using Python.

I would like to please check below link. Basically you want to copy new files as well the modified file for that single copy data is used full. Use event based trigger(when files in created) instead on schedule one.
https://www.mssqltips.com/sqlservertip/6365/incremental-file-load-using-azure-data-factory/

Automate File loading from s3 to snowflake

In s3 bucket daily new JSON files are dumping , i have to create solution which pick the latest file when it arrives PARSE the JSON and load it to Snowflake Datawarehouse. may someone please share your thoughts how can we achieve

There are a number of ways to do this depending on your needs. I would suggest creating an event to trigger a lambda function.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
Another option may be to create a SQS message when the file lands on s3 and have an ec2 instance poll the queue and process as necessary.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/sqs-example-long-polling.html
edit: Here is a more detailed explanation on how to create events from s3 and trigger lambda functions. Documentation is provided by Snowflake
https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-rest-lambda.html

Look into Snowpipe, it lets you do that within the system, making it (possibly) much easier.

There are some aspects to be considered such as is it a batch or streaming data , do you want retry loading the file in case there is wrong data or format or do you want to make it a generic process to be able to handle different file formats/ file types(csv/json) and stages.
In our case we have built a generic s3 to Snowflake load using Python and Luigi and also implemented the same using SSIS but for csv/txt file only.

In my case, I have a python script which get information about the bucket with boto.
Once I detect a change, I call the REST Endpoint Insertfiles on SnowPipe.
Phasing:
detect S3 change
get S3 object path
parse Content and transform to CSV in S3 (same bucket or other snowpipe can connect)
Call SnowPipe REST API
What you need:
Create a user with a public key
Create your stage on SnowFlake with AWS credential in order to access S3
Create your pipe on Snowflake with your user role
Sign a JWT
I also tried with a Talend job with TOS BigData.
Hope it helps.

parallell copy of buckets/keys from boto3 or boto api between 2 different accounts/connections

I want to copy keys from buckets between 2 different accounts using boto3 api's.
In boto3, I executed the following code and the copy worked
source = boto3.client('s3')
destination = boto3.client('s3')
destination.put_object(source.get_object(Bucket='bucket', Key='key'))
Basically I am fetching data from GET and pasting that with PUT in another account.
On Similar lines in boto api, I have done the following
source = S3Connection()
source_bucket = source.get_bucket('bucket')
source_key = Key(source_bucket, key_name)
destination = S3Connection()
destination_bucket = destination.get_bucket('bucket')
dist_key = Key(destination_bucket, source_key.key)
dist_key.set_contents_from_string(source_key.get_contents_as_string())
The above code achieves the purpose of copying any type of data.
But the speed is really very slow. I get around 15-20 seconds to copy data for 1GB. And I have to copy 100GB plus.
I tried python mutithreading wherein each thread does the copy operation. The performance was bad as it took 30 seconds to copy 1GB. I suspect GIL might be the issue here.
I did multiprocessing and I am getting the same result as of single process i.e. 15-20 seconds for 1GB file.
I am using a very high end server with 48 cores and 128GB RAM. The network speed in my environment is 10GBPS.
Most of the search results tell about copying data between buckets in same account and not across accounts. Can anyone please guide me here. Is my approach wrong? Does anyone have a better solution?

Yes, it is wrong approach.
You shouldn't download the file. You are using AWS infrastructure, so you should make use of the efficient AWS backend call to do the works. Your approach is wasting resources.
boto3.client.copy will do the job better than this.
In addition, you didn't describe what you are trying to achieve(e.g. is this some sort of replication requirement? ).
Because with proper understanding of your own needs, it is possible that you don't even need a server to do the job : S3 Bucket events trigger, lambda etc can all execute the copying job without a server.
To copy file between two different AWS account, you can checkout this link Copy S3 object between AWS account
Note :
S3 is a huge virtual object store for everyone, that's why the bucket name MUST be unique. This also mean, the S3 "controller" can done a lot of fancy work similar to a file server , e.g. replication,copy, move file in the backend, without involving network traffics.
As long as you setup the proper IAM permission/policies for the destination bucket, object can move across bucket without additional server.
This is almost similar to file server. User can copy file to each other without "download/upload", instead, one just create a folder with write permission for all, file copy from another user is all done within the file server, with fastest raw disk I/O performance. You don't need powerful instance nor high performance network using backend S3 copy API.
Your method is similar to attempt FTP download file from user using the same file server, which create unwanted network traffics.

You should check out the TransferManager in boto3. It will automatically handle the threading of multipart uploads in an efficient way. See the docs for more detail.
Basically you must have to use the upload_file method and TransferManager will take care of the rest.
import boto3
# Get the service client
s3 = boto3.client('s3')
# Upload tmp.txt to bucket-name at key-name
s3.upload_file("tmp.txt", "bucket-name", "key-name")

Not able to connect the amazon DynamoDb Local using python boto sdk

I want to connect the db available inside DynamoDbLocal using the boto sdk.I followed the documentation as per the below link.
http://boto.readthedocs.org/en/latest/dynamodb2_tut.html#dynamodb-local
This is the official documentation provided by the amazon.But when I am executing the snippet available in the document, I am unable to connect the db and I can't get the tables available inside the db. The dbname is "dummy_us-east-1.db". And my snippet is:
from boto.dynamodb2.layer1 import DynamoDBConnection
con = DynamoDBConnection(host='localhost', port=8000,
aws_access_key_id='dummy',
aws_secret_access_key='dummy',
is_secure=False,
)
print con.list_tables()
I have a 8 tables available inside the db. But I am getting empty list, after executing the list_tables() command.
output:
{u'TableNames':[]}
Instead of accessing the required database, it creating and accessing the new database.
Old database : dummy_us-east-1.db
New database : dummy_localhost.db
How to resolve this.
Please give me some suggestions regarding to the DynamoDbLocal access. Thanks in advance.

It sounds like you are using different approaches to connect to DynamoDB Local.
If so, you can also start DynamoDB Local with the sharedDb flag to force it to use a single db file:
-sharedDb When specified, DynamoDB Local will use a
single database instead of separate databases
for each credential and region. As a result,
all clients will interact with the same set of
tables, regardless of their region and
credential configuration.
E.g.
java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar --sharedDb

Here is the solution. this is because you didn't start the dynamodb with it location of jar file.
java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.