Performance issues when iterating over result set - python

We use the google-cloud-bigquery python library to query Bigquery and process the results in our python script. The processing portion transforms the data and enriches it and in the end creates JSON objects.
This is how we use the BQ library in our script (simplified):
import google.cloud.bigquery
client = bigquery.Client()
query = "SELECT col1,col2,... FROM <table>"
queryjob = client.query(query)
result_set = queryjob.result(page_size=50000)
for page in result_set.pages:
transform_records()
In general and for moderate sized tables, this works just fine. However, we run into a performance issue when querying a table that returns 11 mio records of ~3,5 GB size in total. Even if we leave out the processing, just fetching the pages takes ~ 80 minutes (we did not really observe any significant differences when running it locally or in a VM / cluster that resides in the same region as the bigquery dataset).
Any ideas on how to reduce the loading time?
What we tried:
Varying the page size: The obvious assumption that larger pagesizes hence less pages reduce http overhead holds true. However we noticed that setting page size to above 8.500 did not have any effect (the max number of records returned by the API per page were ~8.500). Still this does only account for improvement in range of a few percent of loading time
Iterating over the result set records instead of pages: Gave us roughly same performance
Separating the data loading and the processing from each other by putting the loading portion into a background thread, using a multiprocessing queue for sharing the data with the processing workers - obviously no impact on the pure time spent on receiving the data from BQ
Trying to fetch multiple pages in parallel - we think this could help reducing the loading time drastically, but did not manage to do so
What we did not try:
Using the BQ storage API, or rather a method that fetches data from BQ using this API (i.e. result_set.to_arrow_iterable / to_dataframe_iterable): We like to avoid the mess of having to deal with data type conversions, as the output of the processing part will be a JSON object
Using the BQ Rest API directly w/o comfort that the bigquery lib offers in order to be able to fetch multiple pages of the result set simultaneously: This seems somewhat complicated and we are not even sure if the API itself allows for this simultaneous access of pages
Exporting the data to GCS first by using client.extract_table-method: We used this approach in other use cases and are aware that fetching data from GCS is way faster. However, as we get acceptable performance for most of our source tables, we'd rather avoid this extra step of exporting to GCS

The approach that you have mentioned should be avoided considering the size of data.
One of the following approaches can be applied:
Transform table data in BigQuery using in-built functions or UDF or Remote Functions and save the transformed data to another table
Export the transformed table data into Cloud Storage in one or more CSV or JSON files.
Load CSV / JSON files to non-GCP system using compute service.
If the transformation is not feasible in BigQuery, then
Export the raw table data in to Cloud Storage in one or more CSV or JSON files.
Load each CSV / JSON file on compute service, transform the data and load the transfomed data to non-GCP system.

Related

How to populate an AWS Timestream DB?

I am trying to use AWS Timestream to store data with timesteamp (in python using boto3).
The data I need to store corresponds to prices over time of different tokens. Each record has 3 field: token_address, timestamp, price. I have around 100 M records (with timestamps from 2019 to now).
I have all the data in a CSV and I would like to populate the DB with it. But I don't find a way to do this in the documentation as I am limited by 100 writes per query according to quotas. The only optimization proposed in documentation is Writing batches of records with common attributes but in my my case they don't share the same values (they all have the same structure but not the same values so I can not define a common_attributes as they do in the example).
So is there a way to populate a Timestream DB without writing records by batch of 100 ?
I asked AWS support, here is the answer:
Unfortunately, "Records per WriteRecords API request" is a non-configurable limit. This limitation is already noted by the development team.
However, to get any additional insights to help with your load, I have reached out to my internal team. I will get back to you as soon as I have an update from the team.
EDIT:
I had a new answer from AWS support:
Team, suggested that a new feature called batch load is being released tentatively at the end of February (2023). This feature will allow the customer to ingest data from CSV files directly into Timestream in bulk.

Export Bigquery table to gcs bucket into multiple folders/files corresponding to clusters

Due to loading time and query cost, I need to export a bigquery table to multiple Google Cloud Storages folders within a bucket.
I currently use ExtractJobConfig from the bigquery python client with the wildcard operator to create multiple files. But I need to create a folder for every nomenclature value (it is within a bigquery table column), and then create the multiple files.
The table is pretty huge and won't fit (could but that's not the idea) the ram, it is 1+ Tb. I cannot dummy loop over it with python.
I read quite a lot of documentation, parsed the parameters, but I can't find a clean solution. Did a miss something or there is no google solution?
My B plan is to us apache beam and dataflow, but I have not skills yet, and I would like to avoid this solution as much as possible for simplicity and maintenance.
You have 2 solutions:
Create 1 export query per aggregation. If you have 100 nomenclature value, query 100 times the table and export the data in the target directory. The issue is the cost: you will pay the 100 processing of the table.
You can use Apache Beam to extract the data and to sort them. Then, with a dynamic destination, you will be able to create all the GCS path that you want. The issue is that it requires skill with Apache Beam to achieve it.
You have an extra solution, similar to the 2nd one, but you can use Spark, and especially Spark serverless to achieve it. If you have more skill in spark than in apache Beam, it could be more efficient.

How to fetch data as a .zip using Cx_Oracle?

I would like to fetch the data but receiving a .zip with all the data instead of a list of tuples of data. That is, making the specified query, then database server compresses the result data as .zip and then sends this .zip to client.
By doing this I expect to reduce time spent on sending data by a lot, because there are lots of repeated fields.
I know Advanced Data compression exists in Oracle, however I am not able to achieve this using Cx_Oracle.
Any help/ workaround is appreciated.
Advanced Network Compression can be enabled as described here, using sqlnet.ora and/or tnsnames.ora:
https://cx-oracle.readthedocs.io/en/latest/user_guide/initialization.html#optnetfiles
https://www.oracle.com/technetwork/database/enterprise-edition/advancednetworkcompression-2141325.pdf

Python REST API using dynamically-loaded persistent data

I try to build a REST API in Python which relies on large data to be loaded dynamically in memory and processed. The data is loaded in Pandas DataFrames, but my question is not specific to Pandas and I might need other data structures.
After a request to the API, I would like to load the useful data (e.g., read from disk or from a DB) and keep it in memory because other requests relying on the same data should follow. After some time, I would need to drop the data in order to save memory.
In practice, I would like to keep a list of Pandas DataFrames in memory. The DataFrames in the list would be the DataFrames needed to fulfill the latest requests. Some DataFrames can be very large (e.g, several GBs), so I think that I cannot afford to retrieve them every time from a DB without a big overhead. This is why I want to keep them in memory for the next requests.
I started with Flask when the API relied on a single, fixed DataFrame. But now I cannot find a way to load dynamically new DataFrames and make them persistent across multiple requests. The loading of a new DataFrame should be triggered inside a request when necessary, and the new DataFrame should be available for the following requests. I do not know how to achieve that with Flask or with any other framework.

Storing pandas DataFrames in SQLAlchemy models

I'm building a flask application that allows users to upload CSV files (with varying columns), preview uploaded files, generate summary statistics, perform complex transformations/aggregations (sometimes via Celery jobs), and then export the modified data. The uploaded file is being read into a pandas DataFrame, which allows me to elegantly handle most of the complicated data work.
I'd like these DataFrames along with associated metadata (time uploaded, ID of user uploading the file, etc.) to persist and be available for multiple users to pass around to various views. However, I'm not sure how best to incorporate the data into my SQLAlchemy models (I'm using PostgreSQL on the backend).
Three approaches I've considered:
Cramming the DataFrame into a PickleType and storing it directly in the DB. This seems to be the most straightforward solution, but means I'll be sticking large binary objects into the database.
Pickling the DataFrame, writing it to the filesystem, and storing the path as a string in the model. This keeps the database small but adds some complexity when backing up the database and allowing users to do things like delete previously uploaded files.
Converting the DataFrame to JSON (DataFrame.to_json()) and storing it as a json type (maps to PostgreSQL's json type). This adds the overhead of parsing JSON each time the DataFrame is accessed, but it also allows the data to be manipulated directly via PostgreSQL JSON operators.
Given the advantages and drawbacks of each (including those I'm unaware of), is there a preferred way to incorporate pandas DataFrames into SQLAlchemy models?
Go towards the JSON and PostgreSQL solution. I am on a Pandas project that started with the Pickle on file system, and loaded the data into to an class object for the data processing with pandas. However, as the data became large, we played with SQLAlchemy / SQLite3. Now, we are finding that working with SQLAlchemy / PostgreSQL is even better. I think our next step will be JSON.
Have fun! Pandas rocks!

Categories