Get a massive csv file from GCS to BQ - python

I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it.
Normally I would download the csv file onto a server to 'process it' and save it either directly to BQ or to an avro file that can be ingested easily by BQ. However, the file(s) are quite large and it's quite possible (and probably) that I wouldn't have the storage/memory to do the batch processing without writing a lot of code to optimize/stream it.
Is this a good use case for using Cloud Dataflow? Are there any tutorials are ways to go about getting a file of format "X" from GCS into BQ? Any tutorial pointers or example scripts to do so would be great.

I'd personally use Dataflow (not Dataprep) and write a simple pipeline to read the file in parallel, clean/transform it, and finally write it to BigQuery. It's pretty straightforward. Here's an example of one in my GitHub repo. Although it's in Java, you could easily port it to Python. Note: it uses the "templates" feature in Dataflow, but this can be changed with one line of code.
If Dataflow is off the table, another option could be to use a weird/unused delimiter and read the entire row into BigQuery. Then use SQL/Regex/UDFs to clean/transform/parse it. See here (suggestion from Felipe). We've done this lots of times in the past, and because you're in BigQuery it scales really well.

I would consider using Cloud Dataprep.
Dataprep can import data from GCS, clean / modify the data and export to BigQuery. One of the features that I like is that everything can be done visually / interactively so that I can see how the data transforms.
Start with a subset of your data to see what transformations are required and to give yourself some practice before loading and processing a TB of data.

You can always transfer from a storage bucket directly into a BQ table:
bq --location=US load --[no]replace --source_format=CSV dataset.table gs://bucket/file.csv [schema]
Here, [schema] can be an inline schema of your csv file (like id:int,name:string,..) or a path to a JSON schema file (available locally).
As per BQ documentation, they try to parallelize large CSV loads into tables. Of course, there is an upper-bound involved: maximum size of an uncompressed (csv) file to be loaded from GCS to BQ should be <= 5TB, which is way above your requirements. I think you should be good with this.

Related

Read columns from Parquet in GCS without reading the entire file?

Reading a parquet file from disc I can choose to read only a few columns (I assume it scans the header/footer, then decides). Is it possible to do this remotely (such as via Google Cloud Storage?)
We have 100 MB parquet files with about 400 columns and we have a use-case where we want to read 3 of them, and show them to the user. The user can choose which columns.
Currently we download the entire file, and then filter it but this takes time.
Long term we will be putting it into Google BigQuery and the problem will be solved
More specifically we use Python with either pandas or PyArrow and ideally would like to use those (either with a GCS backend or manually getting the specific data we need via a wrapper). This runs in Cloud Run so we would prefer to not use Fuse, although that is certainly possible.
I intend to use Python and pandas/pyarrow as the backend for this, running in Cloud Run (hence why data size matter, because 100MB download to disk actually means 100MB downloaded to RAM)
We use pyarrow.parquet.read_parquet with to_pandas() or pandas.read_parquet.
pandas.read_parquet function has columns argument to read a subset of columns.

How to export a huge table from BigQuery into a Google cloud bucket as one file

I am trying to export a huge table (2,000,000,000 rows, roughly 600GB in size) from BigQuery into a google bucket as a single file. All tools suggested in Google's Documentation are limited in export size and will create multiple files.
Is there a pythonic way to do it without needing to hold the entire table in the memory?
While perhaps there are other ways to make it as a script, the recommended solution is to merge the files using Google Storage compose action.
What you have to do is:
export in CSV format
this produces many files
run the compose action batched from 32 files until the final one, the big file is merged
All this can be combined in a cloud Workflow, there is a tutorial here.

How to mark read files in dataflow?

I am using dataflow to read files from GCS bucket and do some transformations on it. I am using beam.io.ReadFromText() method for that.
What is the best way to mark the files that are already read, so that same file will not repeatedly read by dataflow ?
A possible solution is to set up a cloud storage trigger which publishes the name of each file uploaded to the storage bucket as a separate PubSub message to the Topic of your choosing (i.e. projects/PROJECT_ID/topics/TOPIC_NAME).
You can then set up a streaming dataflow pipeline which ingests these PubSub messages via beam.io.ReadFromPubSub(topic='projects/PROJECT_ID/topics/TOPIC_NAME'), from which the filename can be extracted and the data from the file read using beam.io.ReadAllFromText(). You can then continue the pipeline with your own custom transformations.
This pattern negates the need to track the files which have already been transformed as each file is automatically transformed as soon as it is uploaded to the bucket.
I came across the following useful link which could assist with the details of implementing the above pattern (see the 'Streaming processing of GCS files' subsection). https://medium.com/#pavankumarkattamuri/input-source-reading-patterns-in-google-cloud-dataflow-4c1aeade6831
Hope this helps!
A Dataflow job using beam.io.ReadFromText will read each file that matches the given pattern exactly once. I assume from your question you're trying to run a pipeline multiple times and only read files that showed up in the GCS bucket since the last run? In that case, you have two options.
(1) Use apache_beam.io.textio.ReadFromTextWithFilename and then record the set of filenames that you already read somewhere (e.g. write them to a text file) that you consult when constructing the set of files to read on your next run, or
(2) Use apache_beam.io.textio.ReadAllFromText to read from a PCollection of filenames, which is computed to be the set of things that exist in your bucket (e.g. using apache_beam.io.fileio.MatchFiles) but were not read in any previous run (recorded as in (1) via a separate output file in GCS).
It might be worth considering if a streaming pipeline would better meet your needs.

Read snappy or lzo compressed files from DataFlow in Python

Is there a way to read snappy or lzo compressed files on DataFlow using Apache Beam's Python SDK?
Since I couldn't find an easier way, this is my current approach (which seems totally overkill and inefficient):
Start DataProc cluster
Uncompress such data using hive in the new cluster and place it in a temporary location
Stop DataProc cluster
Run the DataFlow job that reads from these temporary uncompressed data
Clean up temporary uncompressed data
I don't think that there is any built in way to do this today with beam. Python beam supports Gzip, bzip2 and deflate.
Option 1: Read in the whole files and decompress manually
Create a custom source to produce a list of filenames (I.e. seeded off of a pipeline option by listing a directory), and emit those as records
In the following ParDo, read each file manually and decompress it. You will need to use a GCS library to read the GCS file, if you have stored your data there.
This solution will likely not perform as fast, and it will not be able to load large files into memory. But if your files are small in size, it might be work well enough.
Option 2: Add a new decompressor to Beam.
You may be able to contribute a decompressor to beam. It looks like you would need to implement the decompressor logic, provide some constants to specify it when authoring a pipleine.
I think one of the constraints is that it must be possible to scan the file and decompress it in chunks at a time. If the compression format requires reading the whole file into memory, then it will likely not work. This is because the TextIO libraries are designed to be record based, which supports reading large files that don't fit into memory and breaking them up into small records for processing.

different pipelines based on files in compressed file

I have a compressed file in a google cloud storage bucket. This file contains a big csv file and a small xml based metadata file. I would like to extract both files and determine the metadata and process the csv file. I am using the Python SDK, and the pipeline will run on Google Dataflow at some point.
The current solution is to use Google Cloud Functions to extract both files and start the pipeline with the parameters parsed from the xml file.
I would like to eliminate the Google Cloud Function and process the compressed file in Apache Beam itself. The pipeline should process the XML file and then process the csv file.
However, I am stuck at extracting the two files into separate collections. I would like to understand if my solution is flawed, or if not, an example on how to deal with different files in a single compressed file.
In my understanding, this is not achievable through any existing text IO in beam.
The problem of your design is that, you are enforcing a dependency of file reading order (metadata xml must be read before processing CSV file and a logic to understand the CSV. Both are not supported in any concrete text IO.
If you do want to have this flexibility, I would suggest that you take a look at vcfio. You might want to write your own reader that inherits from filebasedsource.FileBasedSource too. There is some similarity in the implementation of vcfio to your case, in that there is always a header that explains how to interpret the CSV part in a VCF-formatted file.
Actually if you can somehow rewrite your xml metdata and add it as a header to the csv file, you probably can use vcfio instead.

Categories