I am using dataflow to read files from GCS bucket and do some transformations on it. I am using beam.io.ReadFromText() method for that.
What is the best way to mark the files that are already read, so that same file will not repeatedly read by dataflow ?
A possible solution is to set up a cloud storage trigger which publishes the name of each file uploaded to the storage bucket as a separate PubSub message to the Topic of your choosing (i.e. projects/PROJECT_ID/topics/TOPIC_NAME).
You can then set up a streaming dataflow pipeline which ingests these PubSub messages via beam.io.ReadFromPubSub(topic='projects/PROJECT_ID/topics/TOPIC_NAME'), from which the filename can be extracted and the data from the file read using beam.io.ReadAllFromText(). You can then continue the pipeline with your own custom transformations.
This pattern negates the need to track the files which have already been transformed as each file is automatically transformed as soon as it is uploaded to the bucket.
I came across the following useful link which could assist with the details of implementing the above pattern (see the 'Streaming processing of GCS files' subsection). https://medium.com/#pavankumarkattamuri/input-source-reading-patterns-in-google-cloud-dataflow-4c1aeade6831
Hope this helps!
A Dataflow job using beam.io.ReadFromText will read each file that matches the given pattern exactly once. I assume from your question you're trying to run a pipeline multiple times and only read files that showed up in the GCS bucket since the last run? In that case, you have two options.
(1) Use apache_beam.io.textio.ReadFromTextWithFilename and then record the set of filenames that you already read somewhere (e.g. write them to a text file) that you consult when constructing the set of files to read on your next run, or
(2) Use apache_beam.io.textio.ReadAllFromText to read from a PCollection of filenames, which is computed to be the set of things that exist in your bucket (e.g. using apache_beam.io.fileio.MatchFiles) but were not read in any previous run (recorded as in (1) via a separate output file in GCS).
It might be worth considering if a streaming pipeline would better meet your needs.
Related
I'm trying to upload a set of pd.DataFrames as CSV to a folder in Dropbox using the Dropbox Python SDK (v2). The set of files is not particularly big, but it's numerous. Using batches will help to reduce the API calls and comply with the developer recommendations outlined in the documentation:
"The idea is to group concurrent file uploads into batches, where files
in each batch are uploaded in parallel via multiple API requests to
maximize throughput, but the whole batch is committed in a single,
asynchronous API call to allow Dropbox to coordinate the acquisition
and release of namespace locks for all files in the batch as
efficiently as possible."
Following several answers in SO (see the most relevant to my problem here), and this answer from the SDK maintainers in the Dropbox Forum I tried the following code:
commit_info = []
for df in list_pandas_df:
df_raw_str = df.to_csv(index=False)
upload_session = dbx.upload_session_start(df_raw_str.encode())
commit_info.append(
dbx.files.CommitInfo(path=/path/to/db/folder.csv
)
dbx.files_upload_finish_batch(commit_info)
Nonetheless, when reading the files_upload_finish_batch docstring I noticed that the function only takes a list of CommitInfo as an argument (documentation), which is confusing since the non-batch version (files_upload_session_finish) does take a CommitInfo object with a path, and a cursor object with data about the session.
I'm fairly lost in the documentation, and even the source code is not so helpful to understand how the batch works to upload several files (and not as a case for uploading heavy files). What I am missing here?
I have a very large CSV file (let's say 1TB) that I need to get from GCS onto BQ. While BQ does have a CSV-loader, the CSV files that I have are pretty non-standard and don't end up loading properly to BQ without formatting it.
Normally I would download the csv file onto a server to 'process it' and save it either directly to BQ or to an avro file that can be ingested easily by BQ. However, the file(s) are quite large and it's quite possible (and probably) that I wouldn't have the storage/memory to do the batch processing without writing a lot of code to optimize/stream it.
Is this a good use case for using Cloud Dataflow? Are there any tutorials are ways to go about getting a file of format "X" from GCS into BQ? Any tutorial pointers or example scripts to do so would be great.
I'd personally use Dataflow (not Dataprep) and write a simple pipeline to read the file in parallel, clean/transform it, and finally write it to BigQuery. It's pretty straightforward. Here's an example of one in my GitHub repo. Although it's in Java, you could easily port it to Python. Note: it uses the "templates" feature in Dataflow, but this can be changed with one line of code.
If Dataflow is off the table, another option could be to use a weird/unused delimiter and read the entire row into BigQuery. Then use SQL/Regex/UDFs to clean/transform/parse it. See here (suggestion from Felipe). We've done this lots of times in the past, and because you're in BigQuery it scales really well.
I would consider using Cloud Dataprep.
Dataprep can import data from GCS, clean / modify the data and export to BigQuery. One of the features that I like is that everything can be done visually / interactively so that I can see how the data transforms.
Start with a subset of your data to see what transformations are required and to give yourself some practice before loading and processing a TB of data.
You can always transfer from a storage bucket directly into a BQ table:
bq --location=US load --[no]replace --source_format=CSV dataset.table gs://bucket/file.csv [schema]
Here, [schema] can be an inline schema of your csv file (like id:int,name:string,..) or a path to a JSON schema file (available locally).
As per BQ documentation, they try to parallelize large CSV loads into tables. Of course, there is an upper-bound involved: maximum size of an uncompressed (csv) file to be loaded from GCS to BQ should be <= 5TB, which is way above your requirements. I think you should be good with this.
The Apache Beam documentation Authoring I/O Transforms - Overview states:
Reading and writing data in Beam is a parallel task, and using ParDos, GroupByKeys, etc… is usually sufficient. Rarely, you will need the more specialized Source and Sink classes for specific features.
Could someone please provide a very basic example of how to do this in Python?
For example, if I had a local folder containing 100 jpeg images, how would I:
Use ParDos to read/open the files.
Run some arbitrary code on the images (maybe convert them to grey-scale).
Use ParDos to write the modified images to a different local folder.
Thanks,
Here is an example of pipeline https://github.com/apache/beam/blob/fc738ab9ac7fdbc8ac561e580b1a557b919437d0/sdks/python/apache_beam/examples/wordcount.py#L37
In your case, get the names of the file first and then read each file one at a time and write the output.
You might also want to push the file names to a groupby to use the parallelization provided by the runner.
So in total your pipeline might look something like
Read list of filesnames -> Send filenames to Shuffle using GroupBy Key -> Get 1 filename at a time in a pardo -> Read single file, process and write in a pardo
I have a compressed file in a google cloud storage bucket. This file contains a big csv file and a small xml based metadata file. I would like to extract both files and determine the metadata and process the csv file. I am using the Python SDK, and the pipeline will run on Google Dataflow at some point.
The current solution is to use Google Cloud Functions to extract both files and start the pipeline with the parameters parsed from the xml file.
I would like to eliminate the Google Cloud Function and process the compressed file in Apache Beam itself. The pipeline should process the XML file and then process the csv file.
However, I am stuck at extracting the two files into separate collections. I would like to understand if my solution is flawed, or if not, an example on how to deal with different files in a single compressed file.
In my understanding, this is not achievable through any existing text IO in beam.
The problem of your design is that, you are enforcing a dependency of file reading order (metadata xml must be read before processing CSV file and a logic to understand the CSV. Both are not supported in any concrete text IO.
If you do want to have this flexibility, I would suggest that you take a look at vcfio. You might want to write your own reader that inherits from filebasedsource.FileBasedSource too. There is some similarity in the implementation of vcfio to your case, in that there is always a header that explains how to interpret the CSV part in a VCF-formatted file.
Actually if you can somehow rewrite your xml metdata and add it as a header to the csv file, you probably can use vcfio instead.
I am building a client-server application using python and javascript.
In the frontend, i'm recording audio using recorder.js.
After some fixed interval, i use exportWav() and send the audio file to the server.
At the backend i now need to concatenate these files to make the bigger audiofile again.
I saw this question, but i don't have actual .wav files, just the blobs returned by exportWav.
I'm also using app engine, so i cannot write output to a wav file. I need to create another audioblob that i can store in the datastore.
Any ideas?
Is each segment the complete binary data for a wav file? You'll need to use some kind of format-aware library to concatenate the wavs. The implementation you choose is up to you, but of course it will need to be in python. On the other hand, you could use a Compute Engine instance to run a binary which concatenates the wavs, using the cloud storage client library to ultimately put those wav files in the bucket, cleaning up any temporary files afterward.
If they're just segments of a single wav's binary, you can simply transfer the data and use the cloud storage client library to open the relevant cloud storage blob for writing, writing the new portion to the end of the "file".
It really comes down to the fact that you yourself need to understand what's being returned by exportWav.
If you're set on using blob properties in datastore, you can do this of course, just look up the relevant documentation for storing blobs in datastore, and be aware that you can't "update" objects, or "concatenate" to their properties. If you put a wav today and want to concat to it in 3 months, you'll need to grab the full entity and blob, delete it, concat the new portion in-memory and then put it back.