Reading Partitioned Data through Athena in downstream jobs in pandas - python

I have 2 stages in my data pipeline, First stage reads data from source and dumps to intermediate bucket and next stage reads data from this intermediate bucket. I have athena setup on intermediate stage and we are planning to read this partition data from athena rather than reading a file (reason for using Athena: We might have scenarios where we need to read from different partitions based on some condition in a single read).
Should we go ahead with this approach, as we know Athena has some limitations while reading data into pandas dataframe, like we can only have 1000 records once.
Is there a better solution for this usecase. We are using Pandas.

We have decided to use awsdatawrangler for our purposes since it is more reliable and is meant for the same purpose that we are trying achieve.

Related

Time proficient and cost effective solution for TBs of data transformation from S3

I have TBs of data in S3 bucket which I are text file and I want to transform 'em to json using few custom scripts which are written in Python and put back to another S3 bucket from where Athena can read.
As if now I am using Python, AWS Glue Jobs & Pyspark to do transformation and transforming 1 GB of data takes 3-4 minutes. I feel it's little high and also not cost effective. Also these transformations has to be done every quarter.
I seek AWS ETL experts opinion here if chosen services are good if not what should I use keeping in mind that the transformed data should be available in Athena for analysis purposes. Please suggest cost effective and time efficient solutions.
Thanks.

How can I optimize file I/O in Python when I process GB-sized files via NFS?

I'm manipulating several files via nfs, due to security concerns. The situation is very painful to process something due to slow file I/O. Followings are descriptions of the issue.
I use pandas in Python to do simple processing on data. So I use read_csv() and to_csv() frequently.
Currently, writing of a 10GB csv file requires nearly 30 mins whereas reading consumes 2 mins.
I have enough CPU cores (> 20 cores) and memory (50G~100G).
It is hard to ask more bandwidth.
I need to access data in column-oriented manner, frequently. For example, there would be 100M records with 20 columns (most of them are numeric data). For the data, I frequently read all of 100M records only for 3~4 columns' value.
I've tried with HDF5, but it constructs a larger file and consumes similar time to write. And it does not provide column-oriented I/O. So I've discarded this option.
I cannot store them locally. It would violate many security criteria. Actually I'm working on virtual machine and file system is mounted via nfs.
I repeatedly read several columns. For several columns, no. The task is something like data analysis.
Which approaches can I consider?
In several cases, I use sqlite3 to manipulate data in simple way and exports results into csv files. Can I accelerate I/O tasks by using sqlite3 in Python? If it provide column-wise operation, it would be a good solution, I reckon.
two options: pandas hdf5 or dask.
you can review hdf5 format with format='table'.
HDFStore supports another PyTables format on disk, the table format.
Conceptually a table is shaped very much like a DataFrame, with rows
and columns. A table may be appended to in the same or other sessions.
In addition, delete and query type operations are supported. This
format is specified by format='table' or format='t' to append or put
or to_hdf.
you can use dask read_csv. it read data only when execute()
For purely improve IO performance, i think hdf with compress format is best.

Optimal way to store data from Pandas to Snowflake

The dataframe is huge (7-8 million rows). Tried to_sql with chunksize = 5000 but it never finished.
Using,
from sqlalchemy import create_engine
from snowflake.sqlalchemy import URL
df.to_sql(snowflake_table , engine, if_exists='replace', index=False, index_label=None, chunksize=20000)
What are other optimal solutions for storing data into SF from Pandas DF? Or what am I doing wrong here? The DF is usually of size 7-10 million rows.
The least painful way I can imagine is to dump the file to S3 and have Snowpipe load it into Snowflake automatically. With that set up you don't have to execute any copy command or make any Snowflake calls at all.
Refer to Snowflake documentation for details on how to set up Snowpipe for S3. In short you need to create a stage, a target table, a file format (I guess you already have these things in place though) and a pipe. Then set up SQS notifications for your bucket that the pipe will listen to.
Snowflake suggests having files sized around 10-100 MB, so it is likely a good idea to split the file.
# set up credentials (s3fs is built on BOTO hence this is AWS specific)
fs = s3fs.S3FileSystem(key=key, secret=secret)
# number of files to split into
n_chunks = 2
# loop over dataframe and dump chunk by chunk to S3
# (you likely want to expand file naming logic to avoid overwriting existing files)
for f_name, chunks in enumerate(np.array_split(np.arange(df.shape[0]), n_chunks)):
bytes_to_write = df.iloc[chunks].to_csv(index=False).encode()
with fs.open('s3://mybucket/test/dummy_{}.csv'.format(f_name), 'wb') as f:
f.write(bytes_to_write)
For reference I tried this with a 7M row dataframe splitted into 5 files of around 40 MB. It took around 3 minutes and 40 seconds from starting splitting the dataframe until all rows had arrived in Snowflake.
The optimal way that ilja-everila pointed out is “copy into...” as SF required the csv to be staged on cloud before transformation I was hesitant to do it but it seems like that is the only option given that the performance is in 5-10 minutes for 6.5million records.
for using SQLAlchemy, could you also add, in the connection parameter, the paramstyle=qmark that binds data. This is also referenced here: https://github.com/snowflakedb/snowflake-connector-python/issues/37#issuecomment-365503841
After this change, if you feel appropriate, it may be good idea to do the performance comparison between the SQLAlchemy approach and bulk load approach of writing the large DF to files and use COPY INTO to load the files into Snowflake table.
pandas does an 'insert into ...' with multiple values behind the scene. Snowflake has a restriction up to 16384 records on ingestion. Please change your chunksize=16384.
Snowflake provides the write_pandas and pd_writer helper functions to manage that:
from snowflake.connector.pandas_tools import pd_writer
df.to_sql(snowflake_table, engine, index=False, method=pd_writer)
# ^ here
The pd_writer() function uses write_pandas():
write_pandas(): Writes a Pandas DataFrame to a table in a Snowflake database
To write the data to the table, the function saves the data to Parquet files, uses the PUT command to upload these files to a temporary stage, and uses the COPY INTO command to copy the data from the files to the table.

writing bulk data to big Query

I would like to write the bulk data to BQ using software API.
My restrictions are:
I am going to use the max size of BQ, columns 10,000 and ~35000 rows (this can be bigger)
Schema autodetect is required
If possible, I would like to use some kind of parallelism to write many tables at the same time asynchronously (for that Apache-beam & dataflow might be the solution)
When using Pandas library for BQ, there is a limit on the size of the dataframe that can be written. this requires partitioning of the data
What would be the best way to do so?
Many thanks for any advice / comment,
eilalan
Apache beam would be the right component as it supports huge volume data processing in batch and streaming mode.
I don't think Beam as "Schema auto-detect". But, you can use BigQuery API to fetch the schema if the table already exists.

Exporting BigQuery data for analysis using python

I am new to Google BigQuery so I'm trying to understand how to best accomplish my use case.
I have daily data of customer visits stored in BigQuery that I wish to analyse using some algorithms that I have written in python. Since, there are multiple scripts that use subsets of the daily data, I was wondering what would be the best way to fetch and temporarily store the data. Additionally, the scripts run in a sequential manner. Each script modifies some columns of the data and the subsequent script uses this modified data. After all the scripts have run, I want to store the modified data back to BigQuery.
Some approaches I had in mind are:
Export the bigquery table into a GAE (Google App Engine) instance as a db file and query the relevant data for each script from the db file using sqlite3 python package. Once, all the scripts have run, store the modified table back to BigQuery and then remove the db file from the GAE instance.
Query data from BigQuery every time I want to run a script using the google-cloud python client library or pandas gbq package. Modify the BigQuery table after running each script.
Could somebody know which of these would be a better way to accomplish this (in terms of efficiency/cost) or suggest alternatives?
Thanks!
The answer to your question mostly depends on your use case and the size of the data that you will be processing, so there is not an absolute and correct answer for it.
However, there are some points that you may want to take into account regarding the usage of BigQuery and how some of its features can be interesting for you in the scenario you described.
Let me quickly go over the main topics you should have a look at:
Pricing: leaving aside the billing of storage, and focusing in the cost of queries themselves (which is more related to your use case), BigQuery billing is based on the number of bytes processed on each query. There is a 1TB free quota per month, and from then on, the cost is of $5 per TB of processed data, being the minimum measurable unit 10MB of data.
Cache: when BigQuery returns some information, it is stored in a temporary cached table (or a permanent one if you wish), and they are maintained for approximately 24 hours with some exceptions that you may find in this same documentation link (they are also best-effort, so earlier deletion may happen too). Results returned from a cached table are not billed (because as per the definition of the billing, the cost is based on the number of bytes processed, and accessing a cached table implies that there is no processing being done), as long as you are running the exact same query. I think it would be worth having a look at this feature, because from your sentence "Since there are multiple scripts that use subsets of the daily data", maybe (but just guessing here) it applies to your use case to perform a single query once and then retrieve the results multiple times from a cached version without having to store it anywhere else.
Partitions: BigQuery offers the concept of partitioned tables, which are individual tables that are partitioned into smaller segments by date, what will make it easier to query data daily as you require.
Speed: BigQuery offers a real-time analytics platform, so you will be able to perform fast queries retrieving the information you need, applying some initial processing that you can later use in your custom Python algorithms.
So, in general, I would say that there is no need for you to keep any other database with partial results a part from your BigQuery storage. In terms of resource and cost efficiency, BigQuery offers enough features for you to work with your data locally without having to deal with huge expenses or delays in data retrieving. However, again, this will finally depend on your use case and the amount of data you are storing and need to process simultaneously; but in general terms, I would just go with BigQuery on its own.

Categories