I want to setup a backup system for my firestore databases where every few weeks it gets saved to a gcp bucket and deletes all the tables.
In the documentation I found a gcloud function that lets me export the data to firestore.
gcloud firestore export gs://[BUCKET_NAME]
I know I can import it back into firestore or send the data to bigquery but is there a way that I can just load the exported data into python as a csv, json or some other format without needing to use another tool?
Related
Is there any way to unload data from Snowflake to csv format or can it be directly stored in csv format in google cloud storage?
We are using composer(airflow) dags to connect to snowflake and unload data from tables into csv files and store them in google cloud storage and later on migrating it further.
What I have tried :
Querying data from snowflake table and getting it in a variable.
What I want to do further:
To convert the data into csv file(as have not run the code yet) and to migrate it to GCS bucket but seems like there is only GCStoGCSoperator in airflow which cannot help in this.
What I am thinking:
If I should use python file with scheduler instead of writing in DAG.
Doing it through dataflow(beam) and running it on composer.
Code:-
def func(**context):
dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
result = dwh_hook.get_first("select col1,col2,col3,col4,col5 from table_name where col_name = previous_date_func_here")
# print(result)
I have not yet tested it as I want to test it with GCS but seems like its not gonna work. What could be the ways?
Is it actually even possible with airflow to do this ?
Snowflake supports data unloading using COPY INTO location command:
Unloads data from a table (or query) into one or more files in one of the following locations:
Named internal stage (or table/user stage). The files can then be downloaded from the stage/location using the GET command.
Named external stage that references an external location (Amazon S3, Google Cloud Storage, or Microsoft Azure).
External location (Amazon S3, Google Cloud Storage, or Microsoft Azure).
Format Type Options (formatTypeOptions)
TYPE = CSV
TYPE = JSON
TYPE = PARQUET
Unloading Data from a Table Directly to Files in an External Location
Google Cloud Storage
Access the referenced GCS bucket using a referenced storage integration named myint:
COPY INTO 'gcs://mybucket/unload/'
FROM mytable
STORAGE_INTEGRATION = myint
FILE_FORMAT = (FORMAT_NAME = my_csv_format);
Related: Configuring an Integration for Google Cloud Storage
I am loading a csv file into an Azure Blob Storage account. I would like a process to be triggered when a new file is added, that takes the new CSV and BCP loads it into an Azure SQL database.
My idea is to have an Azure Data Factory pipeline that is event triggered. However, I am stuck as to what to do next. Should an Azure Function be triggered that takes this CSV and uses BCP to load it into the DB? Can Azure Functions even use BCP?
I am using Python.
I would like to please check below link. Basically you want to copy new files as well the modified file for that single copy data is used full. Use event based trigger(when files in created) instead on schedule one.
https://www.mssqltips.com/sqlservertip/6365/incremental-file-load-using-azure-data-factory/
i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow
What is the best possible way to copy a table (with millions of rows) from one type of database to other type using pandas or python?
I have a table in PostreSQL database consisting of millions of rows, I want to move it to Amazon Redshift. What can be the best possible way to achieve that using pandas or python?
The Amazon Database Migration Service (DMS) can handle:
Using a PostgreSQL Database as a Source for AWS DMS - AWS Database Migration Service
Using an Amazon Redshift Database as a Target for AWS Database Migration Service - AWS Database Migration Service
Alternatively, if you wish to do it yourself:
Export the data from PostgreSQL into CSV files (they can be gzip compressed)
Upload the files to Amazon S3
Create the destination tables in Amazon Redshift
Use the COPY command in Amazon Redshift to load the CSV files into Redshift
If you're using Aws services it might be good to use aws Glue, it uses python scripts for its ETL operations, very optimal for Dynamo-->Redshift for example.
If you're not using only Aws services, Try to Export your Redshift data as csv? (i did this for millions of rows) & create a migration tool using c# or whatever to read the csv file & insert your rows after converting them or whatever [Check if the Database technology you're using can take the csv directly so you can avoid doing the migration yourself].
I'm currently trying to send data to a azure document db collection on python (using pydocumentdb lib).
Actually i have to send about 100 000 document on this collection and this takes a very long time (about 2 hours).
I send each document one by one using :
for document in documents :
client.CreateDocument(collection_link, document)
Am i doing wrong, is there another faster way to do it or it's just normal that it takes so long.
Thanks !
On Azure, there are many ways to help importing data to CosmosDB faster than using PyDocumentDB API which be wrappered the related REST APIs via HTTP.
First, to be ready a json file includes your 10,000 documents for importing, then you can follow the documents below to import data.
Refer to the document How to import data into Azure Cosmos DB for the DocumentDB API? to import json data file via DocumentDB Data Migration Tool.
Refer to the document Azure Cosmos DB: How to import MongoDB data? to import json data file via the mongoimport tool of MongoDB.
Upload the json data file to Azure Blob Storage, then to copy data using Azure Data Factory from Blob Storage to CosmosDB, please see the section Example: Copy data from Azure Blob to Azure Cosmos DB to know more details.
If you just want to import data in programming, you can try to use Python MongoDB driver to connect Azure CosmosDB to import data via MongoDB wire protocol, please refer to the document Introduction to Azure Cosmos DB: API for MongoDB.
Hope it helps.