I have millions (about 200 million) records, which I need to insert into my CosmosDb.
I've just discovered that Microsoft failed to implement batch insert capability... (1), (2) <--- 3 years to reply.
Rather than throw myself of a Microsoft building, I have started to look for alternative solutions.
One idea I had was to write all my documents to a file and then import that file into my DB.
Now, how, once I have created my JSON file containing my documents, can I do the bulk import (from Python 3.6)?
I've come across some Migration tool, but I was wondering if there is a better/quicker way and without me having to install this tool... You see, I will be running my code in a WebJob, so installing the migration tool may not be an option, anyway.
I suggest you to use Mongo DB driver which is supported by Azure Document DB.
Mongo DB driver works with the binary protocol, while Azure Document DB SDK works with the HTTP protocol.
As we know,work efficiency of binary protocol is better than HTTP protocol.
You could use Bulk method to import data into Azure Cosmos DB and each group of operations can have at most 1000 operations.
Please refer to the document here.
Notes:
The maximum size for a document is 2 MB in the Azure Document DB which is mentioned here.
Related
i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow
I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.
Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!
I want to call PostgreSQL queries and return results for python APIs?
Basically , do a python and PostgreSQL integration/Connectivity.
So, for specific Python API /calls want to execute the queries n return result.
Also, want to achieve abstraction of PostgreSQL DB.
Thanks.
To add to klin's comment:
psycopg2 -
This is the most popular psql adapter for python. It was build to address heavy concurrency issues with psql database usage. Several extensions are available for added functionality with the DB API.
asyncpg -
More recent psql adapter which seeks to address shortfalls in functionality and performance that exist with psycopg2. Doubles the speed of psycopg's text based data exchange protocol by using binary I/O (which adds generic support for container types). A Major plus is that it has zero dependencies. No personal experience with this adapter but will test soon.
I have a running GAE app that has been collecting data for a while. I am now at the point where I need to run some basic reports on this data and would like to download a subset of the live data to my dev server. Downloading all entities of a kind will simply be too big a data set for the dev server.
Does anyone know of a way to download a subset of entities from a particular kind? Ideally it would be based on entity attributes like date, or client ID etc... but any method would work. I've even tried a regular, full, download then arbitrarily killing the process when I thought I had enough data, but it seems the data is locked up in the .sql3 files generated by the bulkloader.
It looks like that default download/upload from/to GAE datastore utilities don't support filtering (appcfg.py and bulkloader.py).
It seems reasonable to do one of two things:
write a utility (select+export+save-to-local-file) and execute it locally accessing remotely GAE datastore in remote api shell
write a admin web function for select+export+zip - new url in handler + upload to GAE + call-it-using-http
I am working on a project involving insertion a lot of data in to the database. I am wondering if anybody knows how to fill 2 or 3 tables in the database at the same time.An example or psueodecode would be helpful.
Thanks
If you have a lot of data to insert into the database all at once, then you probably are interested in bulk loading data. The ideal tool for that is the bulk loader that likely comes with your database -- Oracle, Microsoft SQL Server, Sybase SQL Server, and MySQL (to name the ones that come to mind) all have bulk loaders. For example, Microsoft has the bulk insert statement and the bcp program to perform this task. I recommend you look into that rather than rigging up some tool in python, with or without threads.