Using TotalOrderPartitioner in Hadoop streaming

Using TotalOrderPartitioner in Hadoop streaming - python

I'm using python with Hadoop streaming to do a project, and I need the similar functionality provided by the TotalOrderPartitioner and InputSampler in Hadoop, that is, I need to sample the data first and create a partition file, then use the partition file to decide which K-V pair will go to which reducer in the mapper. I need to do it in Hadoop 1.0.4.
I could only find some Hadoop streaming examples with KeyFieldBasedPartitioner and customized partitioners, which use the -partitioner option in the command to tell Hadoop to use these partitioners. The examples I found using TotalOrderPartitioner and InputSampler are all in Java, and they need to use the writePartitionFile() of InputSampler and the DistributedCache class to do the job. So I am wondering if it is possible to use TotalOrderPartitioner with hadoop streaming? If it is possible, how can I organize my code to use it? If it is not, is it practical to implement the total partitioner in python first and then use it?

Did not try, but taking the example with KeyFieldBasedPartitioner and simply replacing:
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner
with
-partitioner org.apache.hadoop.mapreduce.lib.partition.TotalOrderPartitioner
Should work.

One possible way to use TotalOrderPartitioner in Hadoop streaming is to recode a small part of it to get the pathname of its partition file from an environment variable then compile it, define that environment variable on your systems and pass its name to the streaming job with the -cmdenv option (documented at https://hadoop.apache.org/docs/stable/hadoop-streaming/HadoopStreaming.html#Streaming_Command_Options.
Source code for TotalOrderPartitioner is available at TotalOrderPartitioner.java. In it getPartitionFile() is defined on two lines starting on line 143 and its second line shows that if its not given an argument it uses DEFAULT_PATH as the partition file name. DEFAULT_PATH is defined on line 54 as "_partition.lst" and line 83 has a comment that says its assumed to be in the DistributedCache. Based on that, without modifying getPartitionFile(), it should be possible to use _partition.lst as the partition filename as long as its in DistributedCache.
That leaves the issue of running an InputSampler to write content to the partition file. I think that's best done by running an already coded Java MapReduce job that uses TotalOrderPartitioner, at least to get an example of the output of InputSampler to determine its format. If the example job can be altered to process the type of data you want then you could use it to create a partition file usable for your purposes. A couple of coded MapReduce jobs using TotalOrderPartitioner are TotalOrderSorting.java and TotalSortMapReduce.java.
Alternatively at twittomatic there is a simple, custom IntervalPartitioner.java in which the partition file pathname is hardcoded as /partitions.lst and in the sorter directory is supplied a script, sample.sh, that builds partition.lst using hadoop, a live twitter feed and sample.py. It should be fairly easy to adapt this system to your needs starting with replacing the twitter feed with a sample of your data.

Related

Apache Beam I/O Transforms

The Apache Beam documentation Authoring I/O Transforms - Overview states:
Reading and writing data in Beam is a parallel task, and using ParDos, GroupByKeys, etc… is usually sufficient. Rarely, you will need the more specialized Source and Sink classes for specific features.
Could someone please provide a very basic example of how to do this in Python?
For example, if I had a local folder containing 100 jpeg images, how would I:
Use ParDos to read/open the files.
Run some arbitrary code on the images (maybe convert them to grey-scale).
Use ParDos to write the modified images to a different local folder.
Thanks,

Here is an example of pipeline https://github.com/apache/beam/blob/fc738ab9ac7fdbc8ac561e580b1a557b919437d0/sdks/python/apache_beam/examples/wordcount.py#L37
In your case, get the names of the file first and then read each file one at a time and write the output.
You might also want to push the file names to a groupby to use the parallelization provided by the runner.
So in total your pipeline might look something like
Read list of filesnames -> Send filenames to Shuffle using GroupBy Key -> Get 1 filename at a time in a pardo -> Read single file, process and write in a pardo

Automate multiple dependent python program

I've multiple python scripts. Each script is dependent on other i.e. the first script uses output of the second script, the second script uses output of the third and so on. Is there anyway i can link up the scripts such that i can automate the whole process. I came across Talend Data Integration Tool but i can't figure out how to use it. Any reference or help would be highly useful.

You did not state what operating system/platform you are using, but the problem seems like a good fit for make.
You specify dependencies between files in your Makefile, along with rules on how to generate one file from the others.
Example:
# file-1 depends on input-file, and is generated via f1-from-input.py
file-1: input-file
f1-from-input.py --input input-file --output file-1
# file-2 depends on file-1, and is generated via f2-from-f1.py
file-2: file-1
f2-from-f1.py < file-1 > file-2
# And so on
For documentation, check out the GNU Make Manual, or one of the million tutorials on the internet.

i found this link it show how to call a python script from Talend and use it's output (not sure if it wait for the code to finish)
The main concept is to
run the python script from Talend Studio
By using tSystem component

Hadoop: Process image files in Python code

I'm working on a side project where we want to process images in a hadoop mapreduce program (for eventual deployment to Amazon's elastic mapreduce). The input to the process will be a list of all the files, each with a little extra data attached (the lat/long position of the bottom left corner - these are aerial photos)
The actual processing needs to take place in Python code so we can leverage the Python Image Library. All the Python streaming examples I can find use stdin and process text input. Can I send image data to Python through stdin? If so, how?
I wrote a Mapper class in Java that takes the list of files and saves the names, the extra data, and the binary contents to a sequence file. I was thinking maybe I need to write a custom Java mapper that takes in the sequence file and pipes it to Python. Is that the right approach? If so, what should the Java to pipe the images out and the Python to read them in look like?
In case it's not obvious, I'm not terribly familiar with Java OR Python, so it's also possible I'm just biting off way more than I can chew with this as my introduction to both languages...

There are a few possible approaches that I can see:
Use both the extra data and the file contents as input to your python program. The tricky part here will be the encoding. I frankly have no idea how streaming works with raw binary content, and I'm assuming that basic answer is "not well." The main issue is that the stdin/stdout communication between processes is very text-based, relying on delimiting input with tabs and newlines, and things like that. You would need to worry about the encoding of the image data, and probably have some sort of pre-processing step, or a custom InputFormat so that you could represent the image as text.
Use only the extra data and the file location as input to your python program. Then the program can independently read the actual image data from the file. The hiccup here is making sure that the file is available to the python script. Remember this is a distributed environment, so the files would have to be in HDFS or somewhere similar, and I don't know if there are good libraries for reading files from HDFS in python.
Do the java-python interaction yourself. Write a java mapper that uses the Runtime class to start the python process itself. This way you get full control over exactly how the two worlds communicate, but obviously its more code and a bit more involved.

Python Streaming : how to reduce to multiple outputs?(its possible with Java though)

I read Hadoop in Action and found that in Java using MultipleOutputFormat and MultipleOutputs classes we can reduce the data to multiple files but what I am not sure is how to achieve the same thing using Python streaming.
for example:
/ out1/part-0000
mapper -> reducer
\ out2/part-0000
If anyone knows, heard, done similar thing, please let me know

The Dumbo Feathers, a set of java classes to use together with Dumbo (a python library that makes it easy to write efficient python M/R programs for hadoop), does this in its output classes.
Basically, in your python dumbo M/R job, you output a key that is a tuple of two elements - the first element being the name of the directory to output to, the second element being the actual key. The output class you've selected then inspects the tuple to find what output directory to use, and use MultipleOutputFormat to write to different subdirectories.
With dumbo, this is easy due to the use of typedbytes as output format, but I think it should be doable even if you have other output formats.

Get input file name in streaming hadoop program

I am able to find the name if the input file in a mapper class using FileSplit when writing the program in Java.
Is there a corresponding way to do this when I write a program in Python (using streaming?)
I found the following in the hadoop streaming document on apache:
See Configured Parameters. During the execution of a streaming job,
the names of the "mapred" parameters are transformed. The dots ( . )
become underscores ( _ ). For example, mapred.job.id becomes
mapred_job_id and mapred.jar becomes mapred_jar. In your code, use the
parameter names with the underscores.
But I still cant understand how to make use of this inside my mapper.
Any help is highly appreciated.
Thanks

According to the "Hadoop : The Definitive Guide"
Hadoop sets job configuration parameters as environment variables for Streaming programs. However, it replaces non-alphanumeric character with underscores to make sure they are valid names. The following Python expression illustrates how you can retrieve the value of the mapred.job.id property from within a Python Streaming script:
os.environ["mapred_job_id"]
You can also set environment variables for the Streaming process launched by MapReduce by applying the -cmdenv option to the Streaming launcher program (once for each variable you wish to set). For example, the following sets the MAGIC_PARAMETER environment variable:
-cmdenv MAGIC_PARAMETER=abracadabra

By parsing the mapreduce_map_input_file(new) or map_input_file(deprecated) environment variable, you will get the map input file name.
Notice:
The two environment variables are case-sensitive, all letters are lower-case.

The new ENV_VARIABLE for Hadoop 2.x is MAPREDUCE_MAP_INPUT_FILE

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.