Exporting data from Athena to Python - python

i am trying to export data from Athena (AWS) to Python. Or, is there a way to connect python to Athena just like there is a way to connect python to MySql.
i have around 15gb data in Athena and would like to export and perform further analysis. There needs to be some way to export such a large dataset.
Detailed steps would be appreciated!
Thanks in adavace!

Data is not actually stored in Amazon Athena. Rather, Amazon Athena looks at data that is stored in Amazon S3 and runs queries across it.
Therefore, if you just want the raw data, simply copy the files directly from S3. Simple!
However, if you wish to run a query in Amazon Athena and export/manipulate the results, you can use Athena — Boto 3 Docs documentation to call Athena from Python.

Related

Is it possible to upload a CSV to redshift and have it automatically run and export the saved queries?

I manually uploaded a CSV to S3 and then copied it into redshift and ran the queries. I want to build a website where you can enter data and have it automatically run the queries when the data is entered and show the results of the queries.
Amazon Redshift does not have Triggers. Therefore, it is not possible to 'trigger' an action when data is loaded into Redshift.
Instead, whatever process you use to load the data will also need to run the queries.

ETL to bigquery using airflow without have permission cloud storage/ cloud sql

i have done ETL from MySql to bigQuery with python, but because i haven't permission to connect google cloud storage/ cloud sql, i must dump data and partition that by last date, this way easy but didn't worth it because take a much time, it is possible to ETL using airflow from MySql/mongo to bigQuery without google cloud storage/ cloud sql ?
With airflow or not, the easiest and the most efficient way is to:
Extract data from data source
Load the data into a file
Drop the file into Cloud Storage
Run a BigQuery Load Job on these files (load job is free)
If you want to avoid to create a file and to drop it into Cloud Storage, another way is possible, much more complex: stream data into BigQuery.
Run a query (MySQL or Mongo)
Fetch the result.
On each line, stream write the result into BigQuery (Streaming is not free on BigQuery)
Described like this, it does not seam very complex but:
You have to maintain the connexion to the source and to the destination during all the process
You have to handle errors (read and write) and be able to restart at the last point of failure
You have to perform bulk stream write into BigQuery for optimizing performance. Size of chunks has to be choose wisely.
Airflow bonus: You have to define and to write your own custom operator for doing this.
By the way, I strongly recommend to follow the first solution.
Additional tips: now, BigQuery can directly request into Cloud SQL database. If you still need your MySQL database (for keeping some referential in it) you can migrate it into CloudSQL and perform a join between your Bigquery data warehouse and your CloudSQL referential.
It is indeed possible to synchronize MySQL databases to BigQuery with Airflow.
You would of course need to make sure you have properly authenticated connections to Airflow DAG workflow.
Also, make sure to define which columns from MySQL you would like to pull and load into BigQuery. You want to also choose the method of loading your data. Would you want it loaded incrementally or fully? Be sure to also formulate a technique for eliminating duplicate copies of data (de-duplicate).
You can find more information on this topic through through this link:
How to Sync Mysql into Bigquery in realtime?
Here is a great resource for setting up your bigquery account and authentications:
https://www.youtube.com/watch?v=fAwWSxJpFQ8
You can also have a look at stichdata.com (https://www.stitchdata.com/integrations/mysql/google-bigquery/)
The Stitch MySQL integration will ETL your MySQL to Google BigQuery in minutes and keep it up to date without having to constantly write and maintain ETL scripts. Google Cloud Storage or Cloud SQL won’t be necessary in this case.
For more information on aggregating data for BigQuery using Apache Airflow you may refer to the link below:
https://cloud.google.com/blog/products/gcp/how-to-aggregate-data-for-bigquery-using-apache-airflow

How to copy a table with millions of rows from PostgreSQL to Amazon Redshift using pandas or python

What is the best possible way to copy a table (with millions of rows) from one type of database to other type using pandas or python?
I have a table in PostreSQL database consisting of millions of rows, I want to move it to Amazon Redshift. What can be the best possible way to achieve that using pandas or python?
The Amazon Database Migration Service (DMS) can handle:
Using a PostgreSQL Database as a Source for AWS DMS - AWS Database Migration Service
Using an Amazon Redshift Database as a Target for AWS Database Migration Service - AWS Database Migration Service
Alternatively, if you wish to do it yourself:
Export the data from PostgreSQL into CSV files (they can be gzip compressed)
Upload the files to Amazon S3
Create the destination tables in Amazon Redshift
Use the COPY command in Amazon Redshift to load the CSV files into Redshift
If you're using Aws services it might be good to use aws Glue, it uses python scripts for its ETL operations, very optimal for Dynamo-->Redshift for example.
If you're not using only Aws services, Try to Export your Redshift data as csv? (i did this for millions of rows) & create a migration tool using c# or whatever to read the csv file & insert your rows after converting them or whatever [Check if the Database technology you're using can take the csv directly so you can avoid doing the migration yourself].

Backing up a postgresql database table with python in lambda

I am attempting to write a python script which will run in AWS Lambda, back up a PostgreSQL database table which is hosted in Amazon RDS, then dump a resulting .bak file or similar to S3.
I'm able to connect to the database and make changes to it, but I'm not quite sure how to go about the next steps. How do I actually back up the DB and write it to a backup file in the S3 bucket?
Depending how large you database is lambda may not be the best solution. lambdas have limits of 512MB tmp disk space, 15 minute timeouts, and 3008 MB memory. Maxing out these limits may also be more expensive then other options.
Using EC2 or fargate along with boto or the aws cli may be a better solution. Here is an blog entry that walks through a solution
https://francescoboffa.com/using-s3-to-store-your-mysql-or-postgresql-backups
The method that worked for me was to create an AWS data pipeline to back up the database to CSV.

AWS Glue - read from a sql server table and write to S3 as a custom CSV file

I am working on Glue since january, and have worked multiple POC, production data lakes using AWS Glue / Databricks / EMR, etc. I have used AWS Glue to read data from S3 and perform ETL before loading to Redshift, Aurora, etc.
I have a need now to read data from a source table which is on SQL SERVER, and fetch data, write to a S3 bucket in a custom (user defined) CSV file, say employee.csv.
Am looking for some pointers, to do this please.
Thanks
You can connect using JDBC specifying connectionType=sqlserver to get a dynamic frame connecting to SQL SERVER. See here for GlueContext docs
dynF = glueContext.getSource(connection_type="sqlserver", url = ..., dbtable=..., user=..., password=)
This task fits AWS DMS (Data Migration Service) use case. DMS is designed to either migrate data from one data storage to another or keep them in sync. It can certainly keep in sync as well as transform your source (i.e., MSSQL) to your target (i.e., S3).
There is one non-negligible constraint in your case thought. Ongoing sync with MSSQL source only works if your license is the Enterprise or Developer Edition and for versions 2016-2019.

Categories