what is the efficient way to insert streaming records into MySQL from google-dataflow using python.Is there any IO connector as in case of bigquery?I see that or bigquery has beam.io.WriteToBigQuery. how can we use similar io connector in cloud-MySQL
You can you JDBCIO to read and write data from/to a JDBC database compliant database.
You can find the details here testWrite
Related
I want to import data from dynamodb table into SQL Server.
I use Python botot3.
Basically, you need to use pymssql:
A simple database interface for Python that builds on top of FreeTDS
to provide a Python DB-API (PEP-249) interface to Microsoft SQL
Server.
You create a connection:
conn = pymssql.connect(server, user, password, "tempdb")
cursor = conn.cursor(as_dict=True)
Then you can use execute or executemany to built an ISNERT statement.
It will be better if you are able to save this data in CSV file and then use BULK INSERT statement as it will be faster if you are working with large amount of data.
Both are supposed to parse connection string and able to insert into say, SQL Server from pandas dataframe.
What is the real difference here?
PyODBC allows you connecting to and using an ODBC database using the standard DB API 2.0. SQL Alchemy is a toolkit that resides one level higher than that and provides a variety of features:
Object-relational mapping (ORM)
Query constructions
Caching
Eager loading
and others. It can work with PyODBC or any other driver that supports DB API 2.0.
I want to load data from our cloud environment (pivotal cloud foundry) into SQL Server. Data is fetched from API and held in memory and we use tds to insert data to SQL Server, but only way in documentation I see to use bulk load is to load a file. I cannot use pyodbc because we dont have odbc connection in cloud env.
How can I do bulk insert directly from dictionary?
pytds does not offer bulk load directly, only from file
The first thing that comes to mind is to convert the data into bulk insert sql. Similar to how you migrate mysql.
Or if you could export the data into cvs, you could import use SSMS (Sql Server Managment Studio).
i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow
Can data be inserted into a RedShift from a local computer without copying data to S3 first?
Basically as a direct insert of record by record into RedShift?
If yes - what library / connection string can be used?
(I am not concerned about performance)
Thanks.
Can data be inserted into a RedShift from a local computer without copying data to S3 first? Basically as a direct insert of record by record into RedShift?
Yes, it could be done. But not a preferred method, though you have already weighted, that performance is not a concern.
You could usepsycopg2 liberary. You could run this from any machine(local/on EC2 or any other cloud platform) having network connection to/for allowed to/from to your Redshift instance.
Here is python code snippet.
import psycopg2
def redshift():
conn = psycopg2.connect(dbname='your_database', host='a********8.****s.redshift.amazonaws.com', port='5439', user='user', password='Pass')
cur = conn.cursor();
cur.execute('insert into test values('1','2','3','4')")
print('success ')
redshift();
It depends if you talk about RedShift or RedShift Spectrum.
In RSS you have to put the data on S3 but if you use RedShift you can make an insert with sqlalchemy for example.
The easiest way to query AWS Redshift from python is through this Jupyter extension - Jupyter Redshift
Not only can you query and save your results but also write them back to the database from within the notebook environment.