Very slow processing in Apache-Beam for BigQuery operations - python

Using beam.io.WriteToBigQuery and beam.io.BigQuerySource
How large is the very large dataset that apache-beam can't handle without partitioning?
They mentioned in the official website :
If you are using the Beam SDK for Python, you might have import size quota issues if you write a very large dataset source
which are really confusing! I have 100,000 row of data in one BigQuery table, and I don't think that is very large?
But I am facing very high latency when read the data and write it again to other table in BigQuery.

Related

Best approach to extract 1000 Gbs of Data from Teradata

What should be the best approach to download data from Teradata using python ? I have more than 3000 GBs of data to be migrated from teradata to another code service. Any approach or advise would be great.
So far I have explored -
Using Pandas it is slow and inefficient. I have also used {fn teradata_try_fastexport} but it seems to be no difference.
Using tdload I am not able to partition the file and no support for downloading file with byte,clob,blob data type.
Using dask, not able to connect with teradata and get the data using dask dataframe. I have read the sql using pandas and then use dask to write the data. However, that seems to be of little help in comparison to huge amount of data.
Pyspark - I am exploring but it seems difficult with too much file configuration and dependency issue.
Any advise or suggested approach that I should follow for the efficient and effective data extraction.

Best python friendly database to use for a 2 billion records

Looking for recommendations for a fast file based database to store some data I'll be loading in to data tables in python3 pandas. Trying to avoid full systems like PostgreSQL,MySQL,MSSQL etc due to the extra daemon setup. Ideally just python scripts and data files loading from dedicated top tier NVME SSD
Will only have a single table with under ten columns but 2 billion records.
Python will regularly be reading through every row.
Have a look at Vaex. For best For performance you need your data in HDF5. For interoperability Apache Arrow and for optimizing disk space & faster network i/o: Apache Parquet.

Get a massive csv file from GCS to BQ

I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it.
Normally I would download the csv file onto a server to 'process it' and save it either directly to BQ or to an avro file that can be ingested easily by BQ. However, the file(s) are quite large and it's quite possible (and probably) that I wouldn't have the storage/memory to do the batch processing without writing a lot of code to optimize/stream it.
Is this a good use case for using Cloud Dataflow? Are there any tutorials are ways to go about getting a file of format "X" from GCS into BQ? Any tutorial pointers or example scripts to do so would be great.
I'd personally use Dataflow (not Dataprep) and write a simple pipeline to read the file in parallel, clean/transform it, and finally write it to BigQuery. It's pretty straightforward. Here's an example of one in my GitHub repo. Although it's in Java, you could easily port it to Python. Note: it uses the "templates" feature in Dataflow, but this can be changed with one line of code.
If Dataflow is off the table, another option could be to use a weird/unused delimiter and read the entire row into BigQuery. Then use SQL/Regex/UDFs to clean/transform/parse it. See here (suggestion from Felipe). We've done this lots of times in the past, and because you're in BigQuery it scales really well.
I would consider using Cloud Dataprep.
Dataprep can import data from GCS, clean / modify the data and export to BigQuery. One of the features that I like is that everything can be done visually / interactively so that I can see how the data transforms.
Start with a subset of your data to see what transformations are required and to give yourself some practice before loading and processing a TB of data.
You can always transfer from a storage bucket directly into a BQ table:
bq --location=US load --[no]replace --source_format=CSV dataset.table gs://bucket/file.csv [schema]
Here, [schema] can be an inline schema of your csv file (like id:int,name:string,..) or a path to a JSON schema file (available locally).
As per BQ documentation, they try to parallelize large CSV loads into tables. Of course, there is an upper-bound involved: maximum size of an uncompressed (csv) file to be loaded from GCS to BQ should be <= 5TB, which is way above your requirements. I think you should be good with this.

writing bulk data to big Query

I would like to write the bulk data to BQ using software API.
My restrictions are:
I am going to use the max size of BQ, columns 10,000 and ~35000 rows (this can be bigger)
Schema autodetect is required
If possible, I would like to use some kind of parallelism to write many tables at the same time asynchronously (for that Apache-beam & dataflow might be the solution)
When using Pandas library for BQ, there is a limit on the size of the dataframe that can be written. this requires partitioning of the data
What would be the best way to do so?
Many thanks for any advice / comment,
eilalan
Apache beam would be the right component as it supports huge volume data processing in batch and streaming mode.
I don't think Beam as "Schema auto-detect". But, you can use BigQuery API to fetch the schema if the table already exists.

Exporting BigQuery data for analysis using python

I am new to Google BigQuery so I'm trying to understand how to best accomplish my use case.
I have daily data of customer visits stored in BigQuery that I wish to analyse using some algorithms that I have written in python. Since, there are multiple scripts that use subsets of the daily data, I was wondering what would be the best way to fetch and temporarily store the data. Additionally, the scripts run in a sequential manner. Each script modifies some columns of the data and the subsequent script uses this modified data. After all the scripts have run, I want to store the modified data back to BigQuery.
Some approaches I had in mind are:
Export the bigquery table into a GAE (Google App Engine) instance as a db file and query the relevant data for each script from the db file using sqlite3 python package. Once, all the scripts have run, store the modified table back to BigQuery and then remove the db file from the GAE instance.
Query data from BigQuery every time I want to run a script using the google-cloud python client library or pandas gbq package. Modify the BigQuery table after running each script.
Could somebody know which of these would be a better way to accomplish this (in terms of efficiency/cost) or suggest alternatives?
Thanks!
The answer to your question mostly depends on your use case and the size of the data that you will be processing, so there is not an absolute and correct answer for it.
However, there are some points that you may want to take into account regarding the usage of BigQuery and how some of its features can be interesting for you in the scenario you described.
Let me quickly go over the main topics you should have a look at:
Pricing: leaving aside the billing of storage, and focusing in the cost of queries themselves (which is more related to your use case), BigQuery billing is based on the number of bytes processed on each query. There is a 1TB free quota per month, and from then on, the cost is of $5 per TB of processed data, being the minimum measurable unit 10MB of data.
Cache: when BigQuery returns some information, it is stored in a temporary cached table (or a permanent one if you wish), and they are maintained for approximately 24 hours with some exceptions that you may find in this same documentation link (they are also best-effort, so earlier deletion may happen too). Results returned from a cached table are not billed (because as per the definition of the billing, the cost is based on the number of bytes processed, and accessing a cached table implies that there is no processing being done), as long as you are running the exact same query. I think it would be worth having a look at this feature, because from your sentence "Since there are multiple scripts that use subsets of the daily data", maybe (but just guessing here) it applies to your use case to perform a single query once and then retrieve the results multiple times from a cached version without having to store it anywhere else.
Partitions: BigQuery offers the concept of partitioned tables, which are individual tables that are partitioned into smaller segments by date, what will make it easier to query data daily as you require.
Speed: BigQuery offers a real-time analytics platform, so you will be able to perform fast queries retrieving the information you need, applying some initial processing that you can later use in your custom Python algorithms.
So, in general, I would say that there is no need for you to keep any other database with partial results a part from your BigQuery storage. In terms of resource and cost efficiency, BigQuery offers enough features for you to work with your data locally without having to deal with huge expenses or delays in data retrieving. However, again, this will finally depend on your use case and the amount of data you are storing and need to process simultaneously; but in general terms, I would just go with BigQuery on its own.

Categories