Speedup my data load operation - python

Please pardon my ignorance if this question may sound silly the expert audience here
Currently as per my use case
I am performing certain analysis on the data present in aws redshift tables and saving them a csv file in s3 buckets
(operation is some what similar to Pivot for redshift database)
and after that i am updating data back to redshift db using copy command
Currently after performing analysis (which is done in python3) for 200 csv files are generated which are saved in 200 different table in redshift
The count of csv would keep on increasing with time
Currently the whole process takes about 50-60 minutes to complete
25 minutes to get approx 200 csv and update them in s3 buckets
25 minutes to update the approx 200 csv into 200 aws redshift tables
The size of csv vary form few MB to 1GB
I was looking for tools or aws technologies which can help me reduce my time
*additional info
Structure of csv keeps on changing .Hence i have to drop and create tables again
This would be a repetitive tasks and would be executed in every 6hours

You can achieve a significant speed-up by:
Using multi-part upload of CSV to S3, so instead of waiting for a single file to upload, multi-part upload will upload the file to S3 in-parallel, saving you considerable time. Read about it here and here. Here is the Boto3 reference for it.
Copying data into Redshift from S3, in parallel. If you split your file in multiple parts, and then run the COPY command, the data will be loaded from multiple files in parallel, instead of waiting for 1 GB file to load, which might be really slow. Read more about it here.
Hope this helps.

You should explore Athena. It's a tool that comes within the AWS package and gives you the flexibility to query csv (or even gzip) files.
It'll save you the time you take to manually copy the data in the Redshift tables and you'll be able to query the dataset from the csv itself. Athena has the ability to query them from an s3 bucket.
However, still in the development phase, you'll have to spend sometime with it as it's not very user friendly. A syntax error in your query logs you out from your AWS session rather than throwing a syntax error. Moreover, you'll not find too many documentation and developer talks over the internet since Athena is still largely unexplored.
Athena charges you depending upon the data that your query fetches and is thus, more pocket friendly. If the query fails to execute, Amazon wouldn't charge you.

Related

ELT Pipeline - AWS RDS to BigQuery

I joined as a junior data engineer at a startup and I'm working on setting up a data warehouse for BI/visualization. I wanted to get an idea of approaches for the extraction/loading part as the company is also new to data engineering.
The company is thinking of going with Google BigQuery for warehousing. The main data source is currently a single OLTP PostgreSQL database hosted on AWS RDS. The database is about 50 GB for now with nearly a hundred tables.
I was initially thinking of using Stitch to integrate directly with BigQuery but since the team is shifting the RDS instance to a private subnet, it would not be possible to access using third party tools which would require a publicly accessible URL (would it?).
How would I go about it? I am still pretty new to data engineering so wanted some advice. I was thinking about using:
RDS -> Lambda/VM with Python extraction/load script -> BigQuery upload using API
But how would I account for changing row values e.g. a customer's status changes in a table. Would BigQuery automatically handle such changes? Plus, I would want to set up regular daily data transfers. For this, I think a cron job can be set up with the Python script to transfer data but would this be a costly approach considering that there are a bunch of large tables (extraction, conversion to dataframe/CSV then uploading to BQ)? As the data size increases, I would need to upsert data instead of overwriting tables. Can BigQuery or other warehouse solutions like Redshift handle this? My main factors to consider for a solution are mostly cost, time to set up and data loading durations.

How to set up GCP infrastructure to perform search quickly over massive set of json data?

I have about 100 million json files (10 TB), each with a particular field containing a bunch of text, for which I would like to perform a simple substring search and return the filenames of all the relevant json files. They're all currently stored on Google Cloud Storage. Normally for a smaller number of files I might just spin up a VM with many CPUs and run multiprocessing via Python, but alas this is a bit too much.
I want to avoid spending too much time setting up infrastructure like a Hadoop server, or loading all of that into some MongoDB database. My question is: what would be a quick and dirty way to perform this task? My original thoughts were to set up something on Kubernetes with some parallel processing running Python scripts, but I'm open to suggestions and don't really have a clue how to go about this.
Easier would be to just load the GCS data into Big Query and just run your query from there.
Send your data to AWS S3 and use Amazon Athena.
The Kubernetes option would be set up a cluster in GKE and install Presto in it with a lot of workers, use a hive metastore with GCS and query from there. (Presto doesn't have direct GCS connector yet, afaik) -- This option seems more elaborate.
Hope it helps!

Boto3 put_object() is very slow

TL;DR: Trying to put .json files into S3 bucket using Boto3, process is very slow. Looking for ways to speed it up.
This is my first question on SO, so I apologize if I leave out any important details. Essentially I am trying to pull data from Elasticsearch and store it in an S3 bucket using Boto3. I referred to this post to pull multiple pages of ES data using the scroll function of the ES Python client. As I am scrolling, I am processing the data and putting it in the bucket as a [timestamp].json format, using this:
s3 = boto3.resource('s3')
data = '{"some":"json","test":"data"}'
key = "path/to/my/file/[timestamp].json"
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
While running this on my machine, I noticed that this process is very slow. Using line profiler, I discovered that this line is consuming over 96% of the time in my entire program:
s3.Bucket('my_bucket').put_object(Key=key, Body=data)
What modification(s) can I make in order to speed up this process? Keep in mind, I am creating the .json files in my program (each one is ~240 bytes) and streaming them directly to S3 rather than saving them locally and uploading the files. Thanks in advance.
Since you are potentially uploading many small files, you should consider a few items:
Some form of threading/multiprocessing. For example you can see How to upload small files to Amazon S3 efficiently in Python
Creating some form of archive file (ZIP) containing sets of your small data blocks and uploading them as larger files. This is of course dependent on your access patterns. If you go this route, be sure to use the boto3 upload_file or upload_fileobj methods instead as they will handle multi-part upload and threading.
S3 performance implications as described in Request Rate and Performance Considerations

JDBC limitation on lists

I am trying to write a data migration script moving data from one database to another (Teradata to snowflake) using JDBC cursors.
The table I am working on has about 170 million records and I am running into the issue where when I execute the batch insert a maximum number of expressions in a list exceeded, expected at most 16,384, got 170,000,000.
I was wondering if there was any way around this or if there was a better way to batch migrate records without exporting the records to a file and moving it to s3 to be consumed by the snowflake.
If your table has 170M records, then using JDBC INSERT to Snowflake is not feasible. It would perform millions of separate insert commands to the database, each requiring a round-trip to the cloud service, which would require hundreds of hours.
Your most efficient strategy would be to export from Teradata into multiple delimited files -- say with 1 - 10 million rows each. You can then either use the Amazon's client API to move the files to S3 using parallelism, or use Snowflake's own PUT command to upload the files to Snowflake's staging area for your target table. Either way, you can then load the files very rapidly using Snowflake's COPY command once they are in your S3 bucket or Snowflake's staging area.

Insert large amount of data to BigQuery via bigquery-python library

I have large csv files and excel files where I read them and create the needed create table script dynamically depending on the fields and types it has. Then insert the data to the created table.
I have read this and understood that I should send them with jobs.insert() instead of tabledata.insertAll() for large amount of data.
This is how I call it (Works for smaller files not large ones).
result = client.push_rows(datasetname,table_name,insertObject) # insertObject is a list of dictionaries
When I use library's push_rows it gives this error in windows.
[Errno 10054] An existing connection was forcibly closed by the remote host
and this in ubuntu.
[Errno 32] Broken pipe
So when I went through BigQuery-Python code it uses table_data.insertAll().
How can I do this with this library? I know we can upload through Google storage but I need direct upload method with this.
When handling large files don't use streaming, but batch load: Streaming will easily handle up to 100,000 rows per second. That's pretty good for streaming, but not for loading large files.
The sample code linked is doing the right thing (batch instead of streaming), so what we see is a different problem: This sample code is trying to load all this data straight into BigQuery, but the uploading through POST part fails. gsutil has a more robust uploading algorithm than just a plain POST.
Solution: Instead of loading big chunks of data through POST, stage them in Google Cloud Storage first, then tell BigQuery to read files from GCS.
See also BigQuery script failing for large file

Categories