set the number of reducers does not work - python

I am using Hadoop streaming with -io typedbytes and set mapred.reduce.tasks=2, but I finally got only one output file. And if I set mapred.reduce.tasks=0, then I got many output files. I am very confused.
SO my question is:
How to make mapred.reduce.tasks = num (num >1) config valid when I using -io typedbytes in streaming?
PS: my mapper's output is (key:string of python, value:array of numpy) .
And my .sh file:
hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.2.1.jar \
-D mapred.reduce.tasks=2 \
-fs local \
-jt local \
-io typedbytes \
-inputformat org.apache.hadoop.mapred.SequenceFileAsBinaryInputFormat \
-input FFT_SequenceFile \
-output pinvoutput \
-mapper 'pinvmap.py' \
-file pinvmap.py \

-D mapred.reduce.tasks=2 \ -fs local \ -jt local
By checking values of -fs and -jt i came to know you are running it in local mode.
In local mode, either zero or one reducer can run atmost.
Because it uses local file system and a single JVM, there is no Hadoop daemons in this mode.
In psuedo distributed mode where all the daemons runs on the same machine, the property -D mapred.reduce.tasks=n will work and results n reducers.
So you should use psuedo distributed mode for working with multiple reducers.
Hope it helps!

Related

Apache Beam wordcount pipeline produces no output using Docker Container

I am able to successfully execute the command in the Apache Beam Python SDK Quickstart
tutorial. Specifically, the command
python -m apache_beam.examples.wordcount --input data.txt --output /tmp/out.beam
creates a file /tmp/out.beam-00000-of-00001 containing correct word-counts for data.txt. However, when I try to execute the same pipeline using a Docker container, per the Custom Containers tutorial, the command appears to produce no output.
Specifically, I run
python -m apache_beam.examples.wordcount \
--input=data.txt \
--output=/tmp/out.beam \
--runner=PortableRunner \
--job_endpoint=embed \
--environment_type="DOCKER" \
--environment_config="apache/beam_python3.9_sdk:latest"
But no file matching /tmp/out.beam* is produced. I have scanned through the output and see no errors. Here is a gist with the output.
I should add that this works when I use DirectRunner:
python -m apache_beam.examples.wordcount \
--input=data.txt \
--output=/tmp/out.beam \
--runner=DirectRunner \
--job_endpoint=embed \
--environment_type="DOCKER" \
--environment_config="apache/beam_python3.9_sdk:latest"
But my impression is that DirectRunner is not performant.
Thank you for your help!

Passing folder as argument to a Docker container with the help of volumes

I have a python script that takes two arguments -input and -output and they are both directory paths. I would like to know first if this is a recommended use case of docker, and I would like to know also how to run the docker container by specifying custom input and output folders with the help of docker volume.
My post is similar to this : Passing file as argument to Docker container.
Still I was not able to solve the problem.
Its common practice to use volumes to persist data or mount some input data. See the postgres image for example.
docker run -d \
--name some-postgres \
-e PGDATA=/var/lib/postgresql/data/pgdata \
-v /custom/mount:/var/lib/postgresql/data \
postgres
You can see how the path to the data dir is set via env var and then a volume is mounted at this path. So the produced data will end up in the volume.
You can also see in the docs that there is a directory /docker-entrypoint-initdb.d/ where you can mount input scripts, that run on first DB setup.
In your case it might look something like this.
docker run \
-v "$PWD/input:/app/input" \
-v "$PWD/output:/app/output" \
myapp --input /app/input --output /app/output
Or you use the same volume for both.
docker run \
-v "$PWD/data:/app/data" \
myapp --input /app/data/input --output /app/data/output

Processing multiple files in HDFS via Python

I have a directory in HDFS that contains roughly 10,000 .xml files. I have a python script "processxml.py" that takes a file and does some processing on it. Is it possible to run the script on all of the files in the hdfs directory, or do I need to copy them to local first in order to do so?
For example, when I run the script on files in a local directory I have:
cd /path/to/files
for file in *.xml
do
python /path/processxml.py
$file > /path2/$file
done
So basically, how would I go about doing the same, but this time the files are in hdfs?
You basically have two options:
1) Use hadoop streaming connector to create a MapReduce job (here you will only need the map part). Use this command from the shell or inside a shell script:
hadoop jar <the location of the streamlib> \
-D mapred.job.name=<name for the job> \
-input /hdfs/input/dir \
-output /hdfs/output/dir \
-file your_script.py \
-mapper python your_script.py \
-numReduceTasks 0
2) Create a PIG script and ship your python code. Here is a basic example for the script:
input_data = LOAD '/hdfs/input/dir';
DEFINE mycommand `python your_script.py` ship('/path/to/your/script.py');
updated_data = STREAM input_data THROUGH mycommand PARALLEL 20;
STORE updated_data INTO 'hdfs/output/dir';
If you need to process data in your files or move/cp/rm/etc. them around the file-system then PySpark (Spark with Python interface) would be one of the best options (speed, memory).

How to load additional JARs for an Hadoop Streaming job on Amazon EMR

TL;DR
How I can upload or specify additional JARs to an Hadoop Streaming Job on Amazon Elastic MapReduce (Amazon EMR)?
Long version
I want to analyze a set of Avro files (> 2000 files) using Hadoop on Amazon Elastic MapReduce (Amazon EMR). It should be a simple exercise through which I should gain some confidence with MapReduce and Amazon EMR (I am new to both).
Since python is my favorite language I have decided to use Hadoop Streaming. I have built a simple mapper and reducer in python, and I have tested it on a local Hadoop (single node install). The command I was issuing on my local Hadoop install was this:
$HADOOP_PREFIX/bin/hadoop jar $HADOOP_PREFIX/share/hadoop/tools/lib/hadoop-streaming-2.4.0-amzn-1.jar \
-files avro-1.7.7.jar,avro-mapred-1.7.7.jar \
-libjars avro-1.7.7.jar,avro-mapred-1.7.7.jar \
-input "input" \
-mapper "python2.7 $PWD/mapper.py" \
-reducer "python2.7 $PWD/reducer.py" \
-output "output/outdir" \
-inputformat org.apache.avro.mapred.AvroAsTextInputFormat
and the job completed successfully.
I have a bucket on Amazon S3 with a folder containing all the input files and another folder with the mapper and reducer scripts (mapper.py and reducer.py respectively).
Using the interface I have created a small cluster, then I have added a bootstrap action to install all the required python modules on each node and then I have added an "Hadoop Streaming" step specifying the location of the mapper and reducer scripts on S3.
The problem is that I don't have the slightest idea on how I can upload or specify in the options the two JARs - avro-1.7.7.jar and avro-mapred-1.7.7.jar - required to run this job?
I have tried several things:
using the -files flag in combination with -libjars in the optional arguments;
adding another bootstrap action that downloads the JARs on every node (and I have tried to download it on different locations on the nodes);
I have tried to upload the JARs on my bucket and specify a full s3://... path as argument to -libjars (note: these file are actively ignored by Hadoop, and a warning is issued) in the options;
If I don't pass the two JARs the job fails (it does not recognize the -inputformat class), but I have tried all the possibilities (and combinations thereof!) I could think of to no avail.
In the end, I figures it out (and it was, of course, obvious):
Here's how I have done it:
add a bootstrap action that downloads the JARs on every node, for example you can upload the JARs in your bucket, make them public and then do:
wget https://yourbucket/path/somejar.jar -O $HOME/somejar.jar
wget https://yourbucket/path/avro-1.7.7.jar -O $HOME/avro-1.7.7.jar
wget https://yourbucket/path/avro-mapred-1.7.7.jar -O $HOME/avro-mapred-1.7.7.jar
when you specify -libjars in the optional arguments use the abosolute path, so:
-libjars /home/hadoop/somejar.jar,$HOME/avro-1.7.7.jar,/home/hadoop/avro-mapred-1.7.7.jar
I have lost a number of hours that I am ashamed to say, hope this helps somebody else.
Edit (Feb 10th, 2015)
I have double checked, and I want to point out that it seems that environment variable are not expanded when passed to the optional arguments field. So, use the explicit $HOME path (i.e. /home/hadoop)
Edit (Feb 11th, 2015)
If you want to launch the a streaming job on Amazon EMR using the AWS cli you can use the following command.
aws emr create-cluster --ami-version '3.3.2' \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType='m1.medium' InstanceGroupType=CORE,InstanceCount=2,InstanceType='m3.xlarge' \
--name 'TestStreamingJob' \
--no-auto-terminate \
--log-uri 's3://path/to/your/bucket/logs/' \
--no-termination-protected \
--enable-debugging \
--bootstrap-actions Path='s3://path/to/your/bucket/script.sh',Name='ExampleBootstrapScript' Path='s3://path/to/your/bucket/another_script.sh',Name='AnotherExample' \
--steps file://./steps_test.json
and you can specify the steps in a JSON file:
[
{
"Name": "Avro",
"Args": ["-files","s3://path/to/your/mapper.py,s3://path/to/your/reducer.py","-libjars","/home/hadoop/avro-1.7.7.jar,/home/hadoop/avro-mapred-1.7.7.jar","-inputformat","org.apache.avro.mapred.AvroAsTextInputFormat","-mapper","mapper.py","-reducer","reducer.py","-input","s3://path/to/your/input_directory/","-output","s3://path/to/your/output_directory/"],
"ActionOnFailure": "CONTINUE",
"Type": "STREAMING"
}
]
(please note that the official Amazon documentation is somewhat outdated, in fact it uses the old Amazon EMR CLI tool which is deprecated in favor of the more recente AWS CLI)

Hadoop Streaming Job failed error in python

From this guide, I have successfully run the sample exercise. But on running my mapreduce job, I am getting the following error
ERROR streaming.StreamJob: Job not Successful!
10/12/16 17:13:38 INFO streaming.StreamJob: killJob...
Streaming Job Failed!
Error from the log file
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Mapper.py
import sys
i=0
for line in sys.stdin:
i+=1
count={}
for word in line.strip().split():
count[word]=count.get(word,0)+1
for word,weight in count.items():
print '%s\t%s:%s' % (word,str(i),str(weight))
Reducer.py
import sys
keymap={}
o_tweet="2323"
id_list=[]
for line in sys.stdin:
tweet,tw=line.strip().split()
#print tweet,o_tweet,tweet_id,id_list
tweet_id,w=tw.split(':')
w=int(w)
if tweet.__eq__(o_tweet):
for i,wt in id_list:
print '%s:%s\t%s' % (tweet_id,i,str(w+wt))
id_list.append((tweet_id,w))
else:
id_list=[(tweet_id,w)]
o_tweet=tweet
[edit] command to run the job:
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input my-input/* -output my-output
Input is any random sequence of sentences.
Thanks,
Your -mapper and -reducer should just be the script name.
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input my-input/* -output my-output
When your scripts are in the job that is in another folder within hdfs which is relative to the attempt task executing as "." (FYI if you ever want to ad another -file such as a look up table you can open it in Python as if it was in the same dir as your scripts while your script is in M/R job)
also make sure you have chmod a+x mapper.py and chmod a+x reducer.py
Try to add
#!/usr/bin/env python
top of your script.
Or,
-mapper 'python m.py' -reducer 'r.py'
You need to explicitly instruct that mapper and reducer are used as python script, as we have several options for streaming. You can use either single quotes or double quotes.
-mapper "python mapper.py" -reducer "python reducer.py"
or
-mapper 'python mapper.py' -reducer 'python reducer.py'
The full command goes like this:
hadoop jar /path/to/hadoop-mapreduce/hadoop-streaming.jar \
-input /path/to/input \
-output /path/to/output \
-mapper 'python mapper.py' \
-reducer 'python reducer.py' \
-file /path/to/mapper-script/mapper.py \
-file /path/to/reducer-script/reducer.py
I ran into this error recently, and my problem turned out to be something as obvious (in hindsight) as these other solutions:
I simply had a bug in my Python code. (In my case, I was using Python v2.7 string formatting whereas the AWS EMR cluster I had was using Python v2.6).
To find the actual Python error, go to Job Tracker web UI (in the case of AWS EMR, port 9100 for AMI 2.x and port 9026 for AMI 3.x); find the failed mapper; open its logs; and read the stderr output.
make sure your input directory only contains the correct files
I too had the same problem
i tried solution of marvin W
and i also install spark , ensure that u have installed spark , not just pyspark(dependency) but also install the framework installtion tutorial
follow that tutorial
if you run this command in a hadoop cluster, make sure that python is installed in every NodeMnager instance.
#hadoop

Categories