JDBC limitation on lists - python

I am trying to write a data migration script moving data from one database to another (Teradata to snowflake) using JDBC cursors.
The table I am working on has about 170 million records and I am running into the issue where when I execute the batch insert a maximum number of expressions in a list exceeded, expected at most 16,384, got 170,000,000.
I was wondering if there was any way around this or if there was a better way to batch migrate records without exporting the records to a file and moving it to s3 to be consumed by the snowflake.

If your table has 170M records, then using JDBC INSERT to Snowflake is not feasible. It would perform millions of separate insert commands to the database, each requiring a round-trip to the cloud service, which would require hundreds of hours.
Your most efficient strategy would be to export from Teradata into multiple delimited files -- say with 1 - 10 million rows each. You can then either use the Amazon's client API to move the files to S3 using parallelism, or use Snowflake's own PUT command to upload the files to Snowflake's staging area for your target table. Either way, you can then load the files very rapidly using Snowflake's COPY command once they are in your S3 bucket or Snowflake's staging area.

Related

ELT Pipeline - AWS RDS to BigQuery

I joined as a junior data engineer at a startup and I'm working on setting up a data warehouse for BI/visualization. I wanted to get an idea of approaches for the extraction/loading part as the company is also new to data engineering.
The company is thinking of going with Google BigQuery for warehousing. The main data source is currently a single OLTP PostgreSQL database hosted on AWS RDS. The database is about 50 GB for now with nearly a hundred tables.
I was initially thinking of using Stitch to integrate directly with BigQuery but since the team is shifting the RDS instance to a private subnet, it would not be possible to access using third party tools which would require a publicly accessible URL (would it?).
How would I go about it? I am still pretty new to data engineering so wanted some advice. I was thinking about using:
RDS -> Lambda/VM with Python extraction/load script -> BigQuery upload using API
But how would I account for changing row values e.g. a customer's status changes in a table. Would BigQuery automatically handle such changes? Plus, I would want to set up regular daily data transfers. For this, I think a cron job can be set up with the Python script to transfer data but would this be a costly approach considering that there are a bunch of large tables (extraction, conversion to dataframe/CSV then uploading to BQ)? As the data size increases, I would need to upsert data instead of overwriting tables. Can BigQuery or other warehouse solutions like Redshift handle this? My main factors to consider for a solution are mostly cost, time to set up and data loading durations.

How to copy a table with millions of rows from PostgreSQL to Amazon Redshift using pandas or python

What is the best possible way to copy a table (with millions of rows) from one type of database to other type using pandas or python?
I have a table in PostreSQL database consisting of millions of rows, I want to move it to Amazon Redshift. What can be the best possible way to achieve that using pandas or python?
The Amazon Database Migration Service (DMS) can handle:
Using a PostgreSQL Database as a Source for AWS DMS - AWS Database Migration Service
Using an Amazon Redshift Database as a Target for AWS Database Migration Service - AWS Database Migration Service
Alternatively, if you wish to do it yourself:
Export the data from PostgreSQL into CSV files (they can be gzip compressed)
Upload the files to Amazon S3
Create the destination tables in Amazon Redshift
Use the COPY command in Amazon Redshift to load the CSV files into Redshift
If you're using Aws services it might be good to use aws Glue, it uses python scripts for its ETL operations, very optimal for Dynamo-->Redshift for example.
If you're not using only Aws services, Try to Export your Redshift data as csv? (i did this for millions of rows) & create a migration tool using c# or whatever to read the csv file & insert your rows after converting them or whatever [Check if the Database technology you're using can take the csv directly so you can avoid doing the migration yourself].

Speed up PostgreSQL to BigQuery

I would like to upload some data that is currently stored in postGreSQL to Google Bigquery to see how the two tools compare.
To move data around there are many options but the most user friendly (for me) one I found thus far leverages the power of python pandas.
sql = "SELECT * FROM {}".format(input_table_name)
i = 0
for chunk in pd.read_sql_query(sql , engine, chunksize=10000):
print("Chunk number: ",i)
i += 1
df.to_gbq(destination_table="my_new_dataset.test_pandas",
project_id = "aqueduct30",
if_exists= "append" )
however this approach is rather slow and I was wondering what options I have to speed things up. My table has 11 million rows and 100 columns.
The postGreSQL is on AWS RDS and I call python from an Amazon EC2 instance. Both are large and fast. I am currently not using multiple processors although there are 16 available.
As alluded to by the comment from JosMac, your solution/approach simply won't scale with large datasets. As you're already running on AWS/RDS then something like the following would be better in my opinion:
Export Postgres table(s) to S3
Use the GCS transfer service to pull export from S3 into GCS
Load directly into BigQuery from GCS (consider automating this pipeline using Cloud Functions and Dataflow)

Speedup my data load operation

Please pardon my ignorance if this question may sound silly the expert audience here
Currently as per my use case
I am performing certain analysis on the data present in aws redshift tables and saving them a csv file in s3 buckets
(operation is some what similar to Pivot for redshift database)
and after that i am updating data back to redshift db using copy command
Currently after performing analysis (which is done in python3) for 200 csv files are generated which are saved in 200 different table in redshift
The count of csv would keep on increasing with time
Currently the whole process takes about 50-60 minutes to complete
25 minutes to get approx 200 csv and update them in s3 buckets
25 minutes to update the approx 200 csv into 200 aws redshift tables
The size of csv vary form few MB to 1GB
I was looking for tools or aws technologies which can help me reduce my time
*additional info
Structure of csv keeps on changing .Hence i have to drop and create tables again
This would be a repetitive tasks and would be executed in every 6hours
You can achieve a significant speed-up by:
Using multi-part upload of CSV to S3, so instead of waiting for a single file to upload, multi-part upload will upload the file to S3 in-parallel, saving you considerable time. Read about it here and here. Here is the Boto3 reference for it.
Copying data into Redshift from S3, in parallel. If you split your file in multiple parts, and then run the COPY command, the data will be loaded from multiple files in parallel, instead of waiting for 1 GB file to load, which might be really slow. Read more about it here.
Hope this helps.
You should explore Athena. It's a tool that comes within the AWS package and gives you the flexibility to query csv (or even gzip) files.
It'll save you the time you take to manually copy the data in the Redshift tables and you'll be able to query the dataset from the csv itself. Athena has the ability to query them from an s3 bucket.
However, still in the development phase, you'll have to spend sometime with it as it's not very user friendly. A syntax error in your query logs you out from your AWS session rather than throwing a syntax error. Moreover, you'll not find too many documentation and developer talks over the internet since Athena is still largely unexplored.
Athena charges you depending upon the data that your query fetches and is thus, more pocket friendly. If the query fails to execute, Amazon wouldn't charge you.

Amazon EC2 & S3 When using Python / SQLite?

Suppose that I have a huge SQLite file (say, 500[MB]) stored in Amazon S3.
Can a python script that is run on a small EC2 instance directly access and modify that SQLite file? or must I first copy the file to the EC2 instance, change it there and then copy over to S3?
Will the I/O be efficient?
Here's what I am trying to do. As I wrote, I have a 500[MB] SQLite file in S3. I'd like to start say 10 different Amazon EC2 instances that will each read a subset of the file and do some processing (every instance will handle a different subset of the 500[MB] SQLite file). Then, once processing is done, every instance will update only the subset of the data it dealt with (as explained, there will be no overlap of data among processes).
For example, suppose that the SQLite file has say 1M rows:
instance 1 will deal with (and update) rows 0 - 100000
instance 2 will will deal with (and update) rows 100001 - 200000
.........................
instance 10 will deal with (and update) rows 900001 - 1000000
Is it at all possible? Does it sound OK? any suggestions / ideas are welcome.
I'd like to start say 10 different Amazon EC2 instances that will each read a subset of the file and do some processing (every instance will handle a different subset of the 500[MB] SQLite file)
You cannot do this with SQLite; on amazon infrastructure or otherwise. sqlite performs database level write locking. unless all ten nodes are performing reads exclusively, you will not attain any kind of concurrency. Even the SQLite website says so.
Situations Where Another RDBMS May Work Better
Client/Server Applications
High-volume Websites
Very large datasets
High Concurrency
Have you considered PostgreSQL?
Since S3 cannot be directly mounted, your best bet is to create an EBS volume containing the SQLite file and work directly with the EBS volume from another (controller) instance. You can then create snapshots of the volume, and archive it into S3. Using a tool like boto (Python API), you can automate the creation of snapshots and the process of moving the backups into S3.
You can mount S3 bucket on your linux machine. See below:
s3fs -
http://code.google.com/p/s3fs/wiki/InstallationNotes
- this did work for me. It uses FUSE file-system + rsync to sync the files
in S3. It kepes a copy of all
filenames in the local system & make
it look like a FILE/FOLDER.
This is good if the system is already in place and running with huge collection of data. But, if you are building this from scratch then I would suggest you to have an EBS volume for SQLite and use this script to create a snapshot of your EBS volume:
https://github.com/rakesh-sankar/Tools/blob/master/AmazonAWS/EBS/ebs-snapshot.sh
If your db structure is simple, why not just use AWS simpledb? Or run mysql (or another DB) on one of your instances.
Amazon EFS can be shared among ec2 instances. It's a managed NFS share. SQLITE will still lock the whole DB on write.
The SQLITE Website does not recommend NFS shares, though. But depending on the application you can share the DB read-only among several ec2 instances and store the results of your processing somewhere else, then concatenate the results in the next step.

Categories