Read snappy or lzo compressed files from DataFlow in Python - python

Is there a way to read snappy or lzo compressed files on DataFlow using Apache Beam's Python SDK?
Since I couldn't find an easier way, this is my current approach (which seems totally overkill and inefficient):
Start DataProc cluster
Uncompress such data using hive in the new cluster and place it in a temporary location
Stop DataProc cluster
Run the DataFlow job that reads from these temporary uncompressed data
Clean up temporary uncompressed data

I don't think that there is any built in way to do this today with beam. Python beam supports Gzip, bzip2 and deflate.
Option 1: Read in the whole files and decompress manually
Create a custom source to produce a list of filenames (I.e. seeded off of a pipeline option by listing a directory), and emit those as records
In the following ParDo, read each file manually and decompress it. You will need to use a GCS library to read the GCS file, if you have stored your data there.
This solution will likely not perform as fast, and it will not be able to load large files into memory. But if your files are small in size, it might be work well enough.
Option 2: Add a new decompressor to Beam.
You may be able to contribute a decompressor to beam. It looks like you would need to implement the decompressor logic, provide some constants to specify it when authoring a pipleine.
I think one of the constraints is that it must be possible to scan the file and decompress it in chunks at a time. If the compression format requires reading the whole file into memory, then it will likely not work. This is because the TextIO libraries are designed to be record based, which supports reading large files that don't fit into memory and breaking them up into small records for processing.

Related

Read columns from Parquet in GCS without reading the entire file?

Reading a parquet file from disc I can choose to read only a few columns (I assume it scans the header/footer, then decides). Is it possible to do this remotely (such as via Google Cloud Storage?)
We have 100 MB parquet files with about 400 columns and we have a use-case where we want to read 3 of them, and show them to the user. The user can choose which columns.
Currently we download the entire file, and then filter it but this takes time.
Long term we will be putting it into Google BigQuery and the problem will be solved
More specifically we use Python with either pandas or PyArrow and ideally would like to use those (either with a GCS backend or manually getting the specific data we need via a wrapper). This runs in Cloud Run so we would prefer to not use Fuse, although that is certainly possible.
I intend to use Python and pandas/pyarrow as the backend for this, running in Cloud Run (hence why data size matter, because 100MB download to disk actually means 100MB downloaded to RAM)
We use pyarrow.parquet.read_parquet with to_pandas() or pandas.read_parquet.
pandas.read_parquet function has columns argument to read a subset of columns.

Reading a .csv.gz file using Dask

I am trying to load a .csv.gz file into Dask.
Reading it this way will load it successfully but into one partition only
dd.read_csv(fp, compression="gzip")
My work around now is to unzip the file using gzip, load it into Dask, then remove it after I am finished. Is there a better way?
There is a very similar question here:
The fundamental reason here, is that a format list bz2, gz or zip does not have allow random-access, the only way to read the data is from the start of the data.
The recommendation there is:
The easiest solution is certainly to stream your large files into several compressed files each (remember to end each file on a newline!), and then load those with Dask as you suggest. Each smaller file will become one dataframe partition in memory, so as long as the files are small enough, you will not run out of memory as you process the data with Dask.
As an alternative option, if disk space is consideration, you can use .to_parquet instead of .to_csv upstream, since parquet data is compressed by default.

Get a massive csv file from GCS to BQ

I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it.
Normally I would download the csv file onto a server to 'process it' and save it either directly to BQ or to an avro file that can be ingested easily by BQ. However, the file(s) are quite large and it's quite possible (and probably) that I wouldn't have the storage/memory to do the batch processing without writing a lot of code to optimize/stream it.
Is this a good use case for using Cloud Dataflow? Are there any tutorials are ways to go about getting a file of format "X" from GCS into BQ? Any tutorial pointers or example scripts to do so would be great.
I'd personally use Dataflow (not Dataprep) and write a simple pipeline to read the file in parallel, clean/transform it, and finally write it to BigQuery. It's pretty straightforward. Here's an example of one in my GitHub repo. Although it's in Java, you could easily port it to Python. Note: it uses the "templates" feature in Dataflow, but this can be changed with one line of code.
If Dataflow is off the table, another option could be to use a weird/unused delimiter and read the entire row into BigQuery. Then use SQL/Regex/UDFs to clean/transform/parse it. See here (suggestion from Felipe). We've done this lots of times in the past, and because you're in BigQuery it scales really well.
I would consider using Cloud Dataprep.
Dataprep can import data from GCS, clean / modify the data and export to BigQuery. One of the features that I like is that everything can be done visually / interactively so that I can see how the data transforms.
Start with a subset of your data to see what transformations are required and to give yourself some practice before loading and processing a TB of data.
You can always transfer from a storage bucket directly into a BQ table:
bq --location=US load --[no]replace --source_format=CSV dataset.table gs://bucket/file.csv [schema]
Here, [schema] can be an inline schema of your csv file (like id:int,name:string,..) or a path to a JSON schema file (available locally).
As per BQ documentation, they try to parallelize large CSV loads into tables. Of course, there is an upper-bound involved: maximum size of an uncompressed (csv) file to be loaded from GCS to BQ should be <= 5TB, which is way above your requirements. I think you should be good with this.

different pipelines based on files in compressed file

I have a compressed file in a google cloud storage bucket. This file contains a big csv file and a small xml based metadata file. I would like to extract both files and determine the metadata and process the csv file. I am using the Python SDK, and the pipeline will run on Google Dataflow at some point.
The current solution is to use Google Cloud Functions to extract both files and start the pipeline with the parameters parsed from the xml file.
I would like to eliminate the Google Cloud Function and process the compressed file in Apache Beam itself. The pipeline should process the XML file and then process the csv file.
However, I am stuck at extracting the two files into separate collections. I would like to understand if my solution is flawed, or if not, an example on how to deal with different files in a single compressed file.
In my understanding, this is not achievable through any existing text IO in beam.
The problem of your design is that, you are enforcing a dependency of file reading order (metadata xml must be read before processing CSV file and a logic to understand the CSV. Both are not supported in any concrete text IO.
If you do want to have this flexibility, I would suggest that you take a look at vcfio. You might want to write your own reader that inherits from filebasedsource.FileBasedSource too. There is some similarity in the implementation of vcfio to your case, in that there is always a header that explains how to interpret the CSV part in a VCF-formatted file.
Actually if you can somehow rewrite your xml metdata and add it as a header to the csv file, you probably can use vcfio instead.

Access HDF files stored on s3 in pandas

I'm storing pandas data frames dumped in HDF format on S3. I'm pretty much stuck as I can't pass the file pointer, the URL, the s3 URL or a StringIO object to read_hdf. If I understand it correctly the file must be present on the filesystem.
Source: https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L315
It looks like it's implemented for CSV but not for HDF. Is there any better way to open those HDF files than copy them to the filesystem?
For the record, these HDF files are being handled on a web server, that's why I don't want a local copy.
If I need to stick with the local file: Is there any way to emulate that file on the filesystem (with a real path) which can be destroyed after the reading is done?
I'm using Python 2.7 with Django 1.9 and pandas 0.18.1.
Newer versions of python allow to read an hdf5 directly from S3 as mentioned in the read_hdf documentation. Perhaps you should upgrade pandas if you can. This of course assumes you've set the right access rights to read those files: either with a credentials file or with public ACLs.
Regarding your last comment, I am not sure why storing several HDF5 per df would necessarily be contra-indicated to the use of HDF5. Pickle should be much slower than HDF5 though joblib.dump might partially improve on this.

Categories