Apache Beam I/O Transforms

Apache Beam I/O Transforms - python

The Apache Beam documentation Authoring I/O Transforms - Overview states:
Reading and writing data in Beam is a parallel task, and using ParDos, GroupByKeys, etc… is usually sufficient. Rarely, you will need the more specialized Source and Sink classes for specific features.
Could someone please provide a very basic example of how to do this in Python?
For example, if I had a local folder containing 100 jpeg images, how would I:
Use ParDos to read/open the files.
Run some arbitrary code on the images (maybe convert them to grey-scale).
Use ParDos to write the modified images to a different local folder.
Thanks,

Here is an example of pipeline https://github.com/apache/beam/blob/fc738ab9ac7fdbc8ac561e580b1a557b919437d0/sdks/python/apache_beam/examples/wordcount.py#L37
In your case, get the names of the file first and then read each file one at a time and write the output.
You might also want to push the file names to a groupby to use the parallelization provided by the runner.
So in total your pipeline might look something like
Read list of filesnames -> Send filenames to Shuffle using GroupBy Key -> Get 1 filename at a time in a pardo -> Read single file, process and write in a pardo

Related

How to mark read files in dataflow?

I am using dataflow to read files from GCS bucket and do some transformations on it. I am using beam.io.ReadFromText() method for that.
What is the best way to mark the files that are already read, so that same file will not repeatedly read by dataflow ?

A possible solution is to set up a cloud storage trigger which publishes the name of each file uploaded to the storage bucket as a separate PubSub message to the Topic of your choosing (i.e. projects/PROJECT_ID/topics/TOPIC_NAME).
You can then set up a streaming dataflow pipeline which ingests these PubSub messages via beam.io.ReadFromPubSub(topic='projects/PROJECT_ID/topics/TOPIC_NAME'), from which the filename can be extracted and the data from the file read using beam.io.ReadAllFromText(). You can then continue the pipeline with your own custom transformations.
This pattern negates the need to track the files which have already been transformed as each file is automatically transformed as soon as it is uploaded to the bucket.
I came across the following useful link which could assist with the details of implementing the above pattern (see the 'Streaming processing of GCS files' subsection). https://medium.com/#pavankumarkattamuri/input-source-reading-patterns-in-google-cloud-dataflow-4c1aeade6831
Hope this helps!

A Dataflow job using beam.io.ReadFromText will read each file that matches the given pattern exactly once. I assume from your question you're trying to run a pipeline multiple times and only read files that showed up in the GCS bucket since the last run? In that case, you have two options.
(1) Use apache_beam.io.textio.ReadFromTextWithFilename and then record the set of filenames that you already read somewhere (e.g. write them to a text file) that you consult when constructing the set of files to read on your next run, or
(2) Use apache_beam.io.textio.ReadAllFromText to read from a PCollection of filenames, which is computed to be the set of things that exist in your bucket (e.g. using apache_beam.io.fileio.MatchFiles) but were not read in any previous run (recorded as in (1) via a separate output file in GCS).
It might be worth considering if a streaming pipeline would better meet your needs.

Google Dataflow / Apache Beam Python - Side-Input from PCollection kills performance

We are running logfile parsing jobs in google dataflow using the Python SDK. Data is spread over several 100s of daily logs, which we read via file-pattern from Cloud Storage. Data volume for all files is about 5-8 GB (gz files) with 50-80 million lines in total.
loglines = p | ReadFromText('gs://logfile-location/logs*-20180101')
In addition, we have a simple (small) mapping csv, that maps logfile-entries to human readable text. Has about 400 lines, 5 kb size.
For Example a logfile entry with [param=testing2] should be mapped to "Customer requested 14day free product trial" in the final output.
We do this in a simple beam.Map with sideinput, like so:
customerActions = loglines | beam.Map(map_logentries,mappingTable)
where map_logentries is the mapping function and mappingTable is said mapping table.
However, this only works if we read the mapping table in native python via open() / read(). If we do the same utilising the beam pipeline via ReadFromText() and pass the resulting PCollection as side-input to the Map, like so:
mappingTable = p | ReadFromText('gs://side-inputs/category-mapping.csv')
customerActions = loglines | beam.Map(map_logentries,beam.pvalue.AsIter(mappingTable))
performance breaks down completely to about 2-3 items per Second.
Now, my questions:
Why would performance break so badly, what is wrong with passing a
PCollection as side-input?
If it is maybe not recommended to use
PCollections as side-input, how is one supposed to build such as
pipeline that needs mappings that can/should not be hard coded into
the mapping function?
For us, the mapping does change frequently and I need to find a way to have "normal" users provide it. The idea was to have the mapping csv available in Cloud Storage, and simply incorporate it into the Pipeline via ReadFromText(). Reading it locally involves providing the mapping to the workers, so only the tech-team can do this.
I am aware that there are caching issues with side-input, but surely this should not apply to a 5kb input.
All code above is pseudo code to explain the problem. Any ideas and thoughts on this would be highly appreciated!

For more efficient side inputs (with small to medium size) you can utilize
beam.pvalue.AsList(mappingTable)
since AsList causes Beam to materialize the data, so you're sure that you will get in-memory list for that pcollection.
Intended for use in side-argument specification---the same places
where AsSingleton and AsIter are used, but forces materialization of
this PCollection as a list.
Source: https://beam.apache.org/documentation/sdks/pydoc/2.2.0/apache_beam.pvalue.html?highlight=aslist#apache_beam.pvalue.AsList

The code looks fine. However, since mappingTable is a mapping, wouldn't beam.pvalue.AsDict be more appropriate for your use case?
Your mappingTable is small enough so side input is a good use case here.
Given that mappingTable is also static, you can load it from GCS in start_bundle function of your DoFn. See the answer to this post for more details. If mappingTable becomes very large in future, you can also consider converting your map_logentries and mappingTable into PCollection of key-value pairs and join them using CoGroupByKey.

pyspark MLUtils saveaslibsvm saving only under _temporary and not saving on master

I use pyspark
And use MLUtils saveaslibsvm to save an RDD on labledpoints
It works but keeps that files in all the worker nodes under /_temporary/ as many files.
No error is thrown, i would like to save the files in the proper folder, and preferably saving all the output to one libsvm file that will be located on the nodes or on the master.
Is that possible?
edit
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
No matter what i do, i can't use MLUtils.loadaslibsvm() to load the libsvm data from the same path i used to save it. maybe something is wrong with writing the file?

This is a normal behavior for Spark. All writing and reading activities are performed in parallel directly from the worker nodes and data is not passed to or from driver node.
This why reading and writing should be performed using storage which can be accessed from each machine, like distributed file system, object store or database. Using Spark with local file system has very limited applications.
For testing you can can use network file system (it is quite easy to deploy) but it won't work well in production.

Using TotalOrderPartitioner in Hadoop streaming

I'm using python with Hadoop streaming to do a project, and I need the similar functionality provided by the TotalOrderPartitioner and InputSampler in Hadoop, that is, I need to sample the data first and create a partition file, then use the partition file to decide which K-V pair will go to which reducer in the mapper. I need to do it in Hadoop 1.0.4.
I could only find some Hadoop streaming examples with KeyFieldBasedPartitioner and customized partitioners, which use the -partitioner option in the command to tell Hadoop to use these partitioners. The examples I found using TotalOrderPartitioner and InputSampler are all in Java, and they need to use the writePartitionFile() of InputSampler and the DistributedCache class to do the job. So I am wondering if it is possible to use TotalOrderPartitioner with hadoop streaming? If it is possible, how can I organize my code to use it? If it is not, is it practical to implement the total partitioner in python first and then use it?

Did not try, but taking the example with KeyFieldBasedPartitioner and simply replacing:
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
with
-partitioner org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner
Should work.

One possible way to use TotalOrderPartitioner in Hadoop streaming is to recode a small part of it to get the pathname of its partition file from an environment variable then compile it, define that environment variable on your systems and pass its name to the streaming job with the -cmdenv option (documented at https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Streaming_Command_Options.
Source code for TotalOrderPartitioner is available at TotalOrderPartitioner.java. In it getPartitionFile() is defined on two lines starting on line 143 and its second line shows that if its not given an argument it uses DEFAULT_PATH as the partition file name. DEFAULT_PATH is defined on line 54 as "_partition.lst" and line 83 has a comment that says its assumed to be in the DistributedCache. Based on that, without modifying getPartitionFile(), it should be possible to use _partition.lst as the partition filename as long as its in DistributedCache.
That leaves the issue of running an InputSampler to write content to the partition file. I think that's best done by running an already coded Java MapReduce job that uses TotalOrderPartitioner, at least to get an example of the output of InputSampler to determine its format. If the example job can be altered to process the type of data you want then you could use it to create a partition file usable for your purposes. A couple of coded MapReduce jobs using TotalOrderPartitioner are TotalOrderSorting.java and TotalSortMapReduce.java.
Alternatively at twittomatic there is a simple, custom IntervalPartitioner.java in which the partition file pathname is hardcoded as /partitions.lst and in the sorter directory is supplied a script, sample.sh, that builds partition.lst using hadoop, a live twitter feed and sample.py. It should be fairly easy to adapt this system to your needs starting with replacing the twitter feed with a sample of your data.

Hadoop: Process image files in Python code

I'm working on a side project where we want to process images in a hadoop mapreduce program (for eventual deployment to Amazon's elastic mapreduce). The input to the process will be a list of all the files, each with a little extra data attached (the lat/long position of the bottom left corner - these are aerial photos)
The actual processing needs to take place in Python code so we can leverage the Python Image Library. All the Python streaming examples I can find use stdin and process text input. Can I send image data to Python through stdin? If so, how?
I wrote a Mapper class in Java that takes the list of files and saves the names, the extra data, and the binary contents to a sequence file. I was thinking maybe I need to write a custom Java mapper that takes in the sequence file and pipes it to Python. Is that the right approach? If so, what should the Java to pipe the images out and the Python to read them in look like?
In case it's not obvious, I'm not terribly familiar with Java OR Python, so it's also possible I'm just biting off way more than I can chew with this as my introduction to both languages...

There are a few possible approaches that I can see:
Use both the extra data and the file contents as input to your python program. The tricky part here will be the encoding. I frankly have no idea how streaming works with raw binary content, and I'm assuming that basic answer is "not well." The main issue is that the stdin/stdout communication between processes is very text-based, relying on delimiting input with tabs and newlines, and things like that. You would need to worry about the encoding of the image data, and probably have some sort of pre-processing step, or a custom InputFormat so that you could represent the image as text.
Use only the extra data and the file location as input to your python program. Then the program can independently read the actual image data from the file. The hiccup here is making sure that the file is available to the python script. Remember this is a distributed environment, so the files would have to be in HDFS or somewhere similar, and I don't know if there are good libraries for reading files from HDFS in python.
Do the java-python interaction yourself. Write a java mapper that uses the Runtime class to start the python process itself. This way you get full control over exactly how the two worlds communicate, but obviously its more code and a bit more involved.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.