I'm currently trying to send data to a azure document db collection on python (using pydocumentdb lib).
Actually i have to send about 100 000 document on this collection and this takes a very long time (about 2 hours).
I send each document one by one using :
for document in documents :
client.CreateDocument(collection_link, document)
Am i doing wrong, is there another faster way to do it or it's just normal that it takes so long.
Thanks !
On Azure, there are many ways to help importing data to CosmosDB faster than using PyDocumentDB API which be wrappered the related REST APIs via HTTP.
First, to be ready a json file includes your 10,000 documents for importing, then you can follow the documents below to import data.
Refer to the document How to import data into Azure Cosmos DB for the DocumentDB API? to import json data file via DocumentDB Data Migration Tool.
Refer to the document Azure Cosmos DB: How to import MongoDB data? to import json data file via the mongoimport tool of MongoDB.
Upload the json data file to Azure Blob Storage, then to copy data using Azure Data Factory from Blob Storage to CosmosDB, please see the section Example: Copy data from Azure Blob to Azure Cosmos DB to know more details.
If you just want to import data in programming, you can try to use Python MongoDB driver to connect Azure CosmosDB to import data via MongoDB wire protocol, please refer to the document Introduction to Azure Cosmos DB: API for MongoDB.
Hope it helps.
Related
I want to setup a backup system for my firestore databases where every few weeks it gets saved to a gcp bucket and deletes all the tables.
In the documentation I found a gcloud function that lets me export the data to firestore.
gcloud firestore export gs://[BUCKET_NAME]
I know I can import it back into firestore or send the data to bigquery but is there a way that I can just load the exported data into python as a csv, json or some other format without needing to use another tool?
I am new in GCP as well as Python.
I am trying to read csv files which is present in google cloud storage and write to data into cloud sql table using Python.Can anyone help on that.Any help will be appreciated.
Thanks in advancs
You shouldn't read and load the data if you haven't update/cleaning to perform on data. You can use the Cloud SQL load CSV from Cloud Storage capability. It also work PostgreSQL (change on top on the page for this)
Do you need code example for calling a REST API in Python? (it's quite basic today, but maybe the security can annoy you!)
I like to use google-cloud-storage when operating with GCS. The package is basically a wrapper for GCloud's API.
Here's how you might use this library:
from google.cloud import storage
# create the client
path_to_service_account = "path/foo.json"
client = storage.Client.from_service_account_json(path_to_service_account)
# get the bucket
bucket_name = "my-bucket"
bucket = client.lookup_bucket(bucket_name)
# loop through the bucket to get the resource you want
for resource in bucket.list_blobs(prefix="dir-name/"):
# file type doesn't need to be a csv...
if resource.endswith("my_file.csv"):
my_blob = bucket.get_blob(resource)
my_blob_name = resource.split("/")[-1]
my_blob.download_to_filename(os.path.join("save-dir", my_blob_name))
# finally, load the file from local storage:
...
i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow
I have millions (about 200 million) records, which I need to insert into my CosmosDb.
I've just discovered that Microsoft failed to implement batch insert capability... (1), (2) <--- 3 years to reply.
Rather than throw myself of a Microsoft building, I have started to look for alternative solutions.
One idea I had was to write all my documents to a file and then import that file into my DB.
Now, how, once I have created my JSON file containing my documents, can I do the bulk import (from Python 3.6)?
I've come across some Migration tool, but I was wondering if there is a better/quicker way and without me having to install this tool... You see, I will be running my code in a WebJob, so installing the migration tool may not be an option, anyway.
I suggest you to use Mongo DB driver which is supported by Azure Document DB.
Mongo DB driver works with the binary protocol, while Azure Document DB SDK works with the HTTP protocol.
As we know,work efficiency of binary protocol is better than HTTP protocol.
You could use Bulk method to import data into Azure Cosmos DB and each group of operations can have at most 1000 operations.
Please refer to the document here.
Notes:
The maximum size for a document is 2 MB in the Azure Document DB which is mentioned here.
We are developing project which is to process our log data. The idea is
update log data from local logstash to Google Cloud Storage
write python script to insert job to import log data from Google
Cloud Storage into Google BigQuery
write python script to process data in BigQuery itself
Note. for python script, we are thinking whether running on google app engine or google compute engine.
The questions are
Is this practical solution?
Structure of log data changes quite often this will cause an error when insert to BigQuery.How we going to handle it in python script?
Incase, we have to rerun log data in particular period. How we can do that?need to write python script?
Thanks
There is a new API for streaming data directly into BigQuery which may be a better match for your use case.
Instead of using a job to load data into BigQuery, you can choose to
stream your data into BigQuery one record at a time by using the
tabledata().insertAll() method. This approach enables querying data
without the delay of running a load job. There are several important
trade-offs to consider before choosing an approach.
If the structure of your data changes, you could have BigQuery run over its tables and update accordingly. Streaming the raw data will give you most flexibility but at the higher cost of having to post-process the data again.
There is the streaming data solution that someone has already mentioned, but if you're trying to move a large block of logs data rather than set up a continuous stream, you may want to take the route of using asynchronous load jobs instead.
The GCS library acts like most python file libraries when used in Google App Engine, and can store files for import in cloud storage buckets:
import cloudstorage as gcs
filePath = "/CloudStorageBucket/dir/dir/logs.json"
with gcs.open(filePath, "w") as f:
f.write(SomeLogData)
f.close()
You can instruct Big Query to load a list of CSV or newline-delimited JSON files in Cloud Storage, by creating load jobs via the API: (Note: you will need to use oauth 2)
from apiclient.discovery import build
service = build("bigquery", "v2", http = oAuthedHttp)
job = {
"configuration": {
"load": {
"sourceUris": ["gs://CloudStorageBucket/dir/dir/logs.json"],
"schema": {
"files" : [
{"name": "Column1",
"type": "STRING"},
...
]
},
"destinationTable": {
"projectId": "Example-BigQuery-ProjectId",
"datasetId": "LogsDataset",
"tableId": "LogsTable"
},
"sourceFormat" : "NEWLINE_DELIMITED_JSON"
"createDisposition": "CREATE_IF_NEEDED"
}
}
}
response = service.jobs().insert(
projectId = "Example-BigQuery-ProjectId",
body = job
).execute()
You can read more about how to create Big Query load jobs if you want to set other properties like write disposition or skipping rows in a CSV file. You can also see other good examples of how to load data, including command line prompts.
Edit:
To answer your more specific questions:
Is this practical solution?
Yes. We export our Google App Engine logs to Cloud Storage and Import to BigQuery, using deferred tasks. Some have used map reduce jobs, but this can be overkill if you don't need to shuffle or reduce.
Structure of log data changes quite often this will cause an error
when insert to BigQuery.How we going to handle it in python script?
It shouldn't be an issue unless you're parsing the messages before they reach big query. A better design would be to port the messages, timestamps, levels etc. to Big Query and then digest it with queries there.
Incase, we have to rerun log data in particular period. How we can do that? need to write python script?
Streaming the data won't give you backups unless you set them up yourself in BigQuery. Using the method I outlined above will automatically give you back ups in Google Cloud Storage, which is preferred.
Know that BigQuery is an OLAP database, not transactional, so it's typically best to rebuild tables each time you add more log data, rather than try to insert new data. It's counter-intuitive, but BigQuery is designed for this, as it can import 10,000 files / 1TB at a time. Using pagination with job write disposition, you can in theory import hundreds of thousands of records fairly quickly. Streaming the data would be ideal if you don't care about having backup logs.