Loading a new CSV in Azure Blob Storage to SQL DB - python

I am loading a csv file into an Azure Blob Storage account. I would like a process to be triggered when a new file is added, that takes the new CSV and BCP loads it into an Azure SQL database.
My idea is to have an Azure Data Factory pipeline that is event triggered. However, I am stuck as to what to do next. Should an Azure Function be triggered that takes this CSV and uses BCP to load it into the DB? Can Azure Functions even use BCP?
I am using Python.

I would like to please check below link. Basically you want to copy new files as well the modified file for that single copy data is used full. Use event based trigger(when files in created) instead on schedule one.
https://www.mssqltips.com/sqlservertip/6365/incremental-file-load-using-azure-data-factory/

Related

load exported firestore db into python

I want to setup a backup system for my firestore databases where every few weeks it gets saved to a gcp bucket and deletes all the tables.
In the documentation I found a gcloud function that lets me export the data to firestore.
gcloud firestore export gs://[BUCKET_NAME]
I know I can import it back into firestore or send the data to bigquery but is there a way that I can just load the exported data into python as a csv, json or some other format without needing to use another tool?

Is it possible to upload a CSV to redshift and have it automatically run and export the saved queries?

I manually uploaded a CSV to S3 and then copied it into redshift and ran the queries. I want to build a website where you can enter data and have it automatically run the queries when the data is entered and show the results of the queries.
Amazon Redshift does not have Triggers. Therefore, it is not possible to 'trigger' an action when data is loaded into Redshift.
Instead, whatever process you use to load the data will also need to run the queries.

ETL to bigquery using airflow without have permission cloud storage/ cloud sql

i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow

How to read and modify a csv file on one bucket in cloud storage and save the results in another bucket using Cloud Functions

I have CSV files coming to a folder on a Cloud Storage bucket and I want to create a Cloud Function that opens the CSV, adds a new column to it and then save the results to a new bucket as new.csv file.
Is there a way to do that using python Cloud Function???
Thanks in advance.
The idea that you’re trying to implement is totally possible and can be achieved using Google Cloud Functions.
For that, you would need to create a storage-triggered Cloud Function. More specifically, you can create your function in a way that it will respond to change notifications emerging from your Google Cloud Storage.
These notifications can be configured to respond to various events inside a bucket: object creation, deletion, archiving and metadata updates.
For the situation described, you will need to use the trigger google.storage.object.finalize.
This event is sent when a new object is created in the bucket or an existing object is overwritten, and a new generation of that object is created.
Here you can find a sample code of a storage-triggered Cloud Function written in Python, while this tutorial will give you a more detailed overview of the usage of storage-triggered functions.

Automate File loading from s3 to snowflake

In s3 bucket daily new JSON files are dumping , i have to create solution which pick the latest file when it arrives PARSE the JSON and load it to Snowflake Datawarehouse. may someone please share your thoughts how can we achieve
There are a number of ways to do this depending on your needs. I would suggest creating an event to trigger a lambda function.
https://docs.aws.amazon.com/lambda/latest/dg/with-s3.html
Another option may be to create a SQS message when the file lands on s3 and have an ec2 instance poll the queue and process as necessary.
https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/sqs-example-long-polling.html
edit: Here is a more detailed explanation on how to create events from s3 and trigger lambda functions. Documentation is provided by Snowflake
https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-rest-lambda.html
Look into Snowpipe, it lets you do that within the system, making it (possibly) much easier.
There are some aspects to be considered such as is it a batch or streaming data , do you want retry loading the file in case there is wrong data or format or do you want to make it a generic process to be able to handle different file formats/ file types(csv/json) and stages.
In our case we have built a generic s3 to Snowflake load using Python and Luigi and also implemented the same using SSIS but for csv/txt file only.
In my case, I have a python script which get information about the bucket with boto.
Once I detect a change, I call the REST Endpoint Insertfiles on SnowPipe.
Phasing:
detect S3 change
get S3 object path
parse Content and transform to CSV in S3 (same bucket or other snowpipe can connect)
Call SnowPipe REST API
What you need:
Create a user with a public key
Create your stage on SnowFlake with AWS credential in order to access S3
Create your pipe on Snowflake with your user role
Sign a JWT
I also tried with a Talend job with TOS BigData.
Hope it helps.

Categories