My python mapper and reducer code is running fine when i m running with out hadoop streaming command
hadoop fs -cat /user/root/myinput/testfile3_node.csv | ./mapper_1.py | sort | ./reducer_1.py
where as when i am running the code using hadoop streaming command then it fails
hadoop jar /usr/iop/current/hadoop-mapreduce-client/hadoop-streaming.jar -mapper ./mapper_1.py -reducer ./reducer_1.py -file ./mapper_1.py -file ./reducer_1.py -input /user/root/myinput/testfile3.csv -output /user/root/myoutput/indexing_output1
Outputs:
Screenshot of simple command_running.
Screenshot of Hadoop streaming jar command.
Try without ./ on the -mapper and -reducer parameters(make sure you are in the right directory) and also there is no need fot the -files:
hadoop jar /usr/iop/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-mapper mapper_1.py \
-reducer reducer_1.py \
-input /user/root/myinput/testfile3.csv -output /user/root/myoutput/indexing_output1
Here is the Apache Hadoop docs:
https://hadoop.apache.org/docs/r1.2.1/streaming.html
I have been using a Hadoop MapReduce command on SSH like this:
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.0.1.jar -file /nfs_home/appers/user1/mapper.py -file /nfs_home/appers/user1/reducer.py -mapper '/usr/lib/python_2.7.3/bin/python mapper.py' -reducer '/usr/lib/python_2.7.3/bin/python reducer.py' -input /ccexp/data/test_xml/0901282-510179094535002-oozie-oozi-W/extractOut//.xml -output /user/ccexptest/output/user1/MRoutput
However, I am expanding the capability of the Mapper.py script and would like to pass a string argument to the Mapper when launching the MapReduce job. How can I edit the above MR command so that a string is included as an argument to the Mapper?
You can pass an arbitrary command line input into the -mapper and -reducer, for your case
-mapper '/usr/lib/python_2.7.3/bin/python mapper.py myargs'
I am running word count program in hadoop with below command
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar -file /home/Hadoop/Python/mapper.py -mapper mapper.py -file /home/Hadoop/Python/reducer.py -reducer reducer.py -input "/Hadoop/Hive.txt" -output "/Hadoop/output.txt"
And below error causing the failure of the program
Caused by: java.io.IOException: Cannot run program
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1427776730247_0008/container_1427776730247_0008_01_000006/./mapper.py":
error=2, No such file or directory
I think these are runtime directories created and yarn is having read write permission for this directory.
Do I need to manually change permission for these directories and place mapper and reducer file there?
I'm using Hadoop streaming to use my mapper and reducer code in python to run a Mapreduce job. I have input data in s3, and I'm trying to use that for the job. However, when I run the command like this -->
bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file aish1/mapperi.py
-mapper aish1/mapperi.py -file aish1/reduceri.py -reducer aish1/reduceri.py
-file s3://INLOCATION -input s3://INLOCATION -output s3://OUTLOCATION
I get the error:
File: /home/hadoop/s3:/INLOCATION does not exist, or is not readable.
Streaming Command Failed!
I don't understand why it adds the /home/hadoop/ in front of my s3 INLOCATION. Any help would be greatly appreciated!
Don't use -file preparation for input.
Argument -file should be used when you want to use files from local file system, so Hadoop will upload them to HDFS. In you case input is already in appropriable location.
Change you invokation:
bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -file aish1/mapperi.py
-mapper aish1/mapperi.py -file aish1/reduceri.py -reducer aish1/reduceri.py
-input s3://INLOCATION -output s3://OUTLOCATION
From this guide, I have successfully run the sample exercise. But on running my mapreduce job, I am getting the following error
ERROR streaming.StreamJob: Job not Successful!
10/12/16 17:13:38 INFO streaming.StreamJob: killJob...
Streaming Job Failed!
Error from the log file
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Mapper.py
import sys
i=0
for line in sys.stdin:
i+=1
count={}
for word in line.strip().split():
count[word]=count.get(word,0)+1
for word,weight in count.items():
print '%s\t%s:%s' % (word,str(i),str(weight))
Reducer.py
import sys
keymap={}
o_tweet="2323"
id_list=[]
for line in sys.stdin:
tweet,tw=line.strip().split()
#print tweet,o_tweet,tweet_id,id_list
tweet_id,w=tw.split(':')
w=int(w)
if tweet.__eq__(o_tweet):
for i,wt in id_list:
print '%s:%s\t%s' % (tweet_id,i,str(w+wt))
id_list.append((tweet_id,w))
else:
id_list=[(tweet_id,w)]
o_tweet=tweet
[edit] command to run the job:
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input my-input/* -output my-output
Input is any random sequence of sentences.
Thanks,
Your -mapper and -reducer should just be the script name.
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input my-input/* -output my-output
When your scripts are in the job that is in another folder within hdfs which is relative to the attempt task executing as "." (FYI if you ever want to ad another -file such as a look up table you can open it in Python as if it was in the same dir as your scripts while your script is in M/R job)
also make sure you have chmod a+x mapper.py and chmod a+x reducer.py
Try to add
#!/usr/bin/env python
top of your script.
Or,
-mapper 'python m.py' -reducer 'r.py'
You need to explicitly instruct that mapper and reducer are used as python script, as we have several options for streaming. You can use either single quotes or double quotes.
-mapper "python mapper.py" -reducer "python reducer.py"
or
-mapper 'python mapper.py' -reducer 'python reducer.py'
The full command goes like this:
hadoop jar /path/to/hadoop-mapreduce/hadoop-streaming.jar \
-input /path/to/input \
-output /path/to/output \
-mapper 'python mapper.py' \
-reducer 'python reducer.py' \
-file /path/to/mapper-script/mapper.py \
-file /path/to/reducer-script/reducer.py
I ran into this error recently, and my problem turned out to be something as obvious (in hindsight) as these other solutions:
I simply had a bug in my Python code. (In my case, I was using Python v2.7 string formatting whereas the AWS EMR cluster I had was using Python v2.6).
To find the actual Python error, go to Job Tracker web UI (in the case of AWS EMR, port 9100 for AMI 2.x and port 9026 for AMI 3.x); find the failed mapper; open its logs; and read the stderr output.
make sure your input directory only contains the correct files
I too had the same problem
i tried solution of marvin W
and i also install spark , ensure that u have installed spark , not just pyspark(dependency) but also install the framework installtion tutorial
follow that tutorial
if you run this command in a hadoop cluster, make sure that python is installed in every NodeMnager instance.
#hadoop