Failed to include a third party python package with Hadoop streaming job - python

I would like to include a third party python library when running Hadoop streaming job.
I followed the suggestions in the post here but it doesn't seem to work.
I submitted a command like this:
hadoop jar /usr/local/hadoop/hadoop-2.2.0/lib/hadoop-streaming-2.2.0.jar \
-input $hdfs_input_file \
-output $hdfs_output_file \
-mapper $mapper_file \
-combiner $reducer_file \
-reducer $reducer_file \
-file $mapper_file \
-file $reducer_file \
-file $packaged_file
The $packaged_file is a packaged file that contains the third party library.
My script failed at this line (in $mapper_file):
xyz = importer.load_module('library_name')
The error message is
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:320)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:533)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
However, the above line of code run well in ipython. I could even run the following line in ipython
xyz.method_foo()
Any suggestions on this problem? Thanks!

Related

hadoop, python, subprocess failed with code 127

I'm trying to run very simple task with mapreduce.
mapper.py:
#!/usr/bin/env python
import sys
for line in sys.stdin:
print line
my txt file:
qwerty
asdfgh
zxc
Command line to run the job:
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.8.0.jar \
-input /user/cloudera/In/test.txt \
-output /user/cloudera/test \
-mapper /home/cloudera/Documents/map.py \
-file /home/cloudera/Documents/map.py
Error:
INFO mapreduce.Job: Task Id : attempt_1490617885665_0008_m_000001_0, Status : FAILED
Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:325)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:538)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:130)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:61)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:34)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:453)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1693)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
How to fix this and run the code?
When I use cat /home/cloudera/Documents/test.txt | python /home/cloudera/Documents/map.py it works fine
!!!!!UPDATE
Something wrong with my *.py file. I have copied file from github 'tom white hadoop book' and everything is working fine.
But I cant understand what is the reason. It is not the permissions and charset (if I am not wrong). What else can it be?
I faced the same problem.
Issue:
When the python file is created in Windows environment the new line character is CRLF.
My hadoop runs on Linux which understands the newline character as LF
Solution:
After changing the CRLF to LF the step ran successfully.
In -mapper argument you should set command, for running on cluster nodes. So there are no /home/cloudera/Documents/map.py file there.
Files that you pass with -files option are placed in working directory, so you can simply use it in this way: ./map.py
I don't remember what permissions are set to this file, so if there are no execute permissions use it as python map.py
so the full command is
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.6.0-mr1-cdh5.8.0.jar \
-input /user/cloudera/In/test.txt \
-output /user/cloudera/test \
-mapper "python map.py" \
-file /home/cloudera/Documents/map.py
You have an error in your mapper.py or reducer.py.for example:
Not using #!/usr/bin/env python on top of files.
Syntax or logical error in your python codes. (for example print has different syntax in python2 and python3.)
First Check python --version. If output of python --version is
Command 'python' not found, but can be installed with:
sudo apt install python3
sudo apt install python
sudo apt install python-minimal
You also have python3 installed, you can run 'python3' instead.
Install python by using sudo apt install python and run your hadoop job
On my PC it worked and finally it's working
On local HADOOP 3.2.1 on macOS, I have solved my issue java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 127 here : https://stackoverflow.com/a/61624913/4201275
Lets assume this is your streaming job that is how it looks in windows. The .py file has new line character is CRLF. So you need do manually the clean up CRLF to LF or use this SED command and you should be good.
!sed -i -e 's/\r$//' WordCount/reducer.py
!sed -i -e 's/\r$//' WordCount/mapper.py
I used the ! here to tell the Python notebook that I am executing in a VM machine on Windows
!hadoop jar {JAR_FILE} \
-files WordCount/reducer.py,WordCount/mapper.py \
-mapper mapper.py \
-reducer reducer.py \
-input {HDFS_DIR}/alice.txt \
-output {HDFS_DIR}/wordcount-output \
-cmdenv PATH={PATH}

map and reduce is getting failed when running using hadoop streaming command

My python mapper and reducer code is running fine when i m running with out hadoop streaming command
hadoop fs -cat /user/root/myinput/testfile3_node.csv | ./mapper_1.py | sort | ./reducer_1.py
where as when i am running the code using hadoop streaming command then it fails
hadoop jar /usr/iop/current/hadoop-mapreduce-client/hadoop-streaming.jar -mapper ./mapper_1.py -reducer ./reducer_1.py -file ./mapper_1.py -file ./reducer_1.py -input /user/root/myinput/testfile3.csv -output /user/root/myoutput/indexing_output1
Outputs:
Screenshot of simple command_running.
Screenshot of Hadoop streaming jar command.
Try without ./ on the -mapper and -reducer parameters(make sure you are in the right directory) and also there is no need fot the -files:
hadoop jar /usr/iop/current/hadoop-mapreduce-client/hadoop-streaming.jar \
-mapper mapper_1.py \
-reducer reducer_1.py \
-input /user/root/myinput/testfile3.csv -output /user/root/myoutput/indexing_output1
Here is the Apache Hadoop docs:
https://hadoop.apache.org/docs/r1.2.1/streaming.html

Wordcount error using Python

I am running word count program in hadoop with below command
hadoop jar /usr/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming.jar -file /home/Hadoop/Python/mapper.py -mapper mapper.py -file /home/Hadoop/Python/reducer.py -reducer reducer.py -input "/Hadoop/Hive.txt" -output "/Hadoop/output.txt"
And below error causing the failure of the program
Caused by: java.io.IOException: Cannot run program
"/var/lib/hadoop-yarn/cache/yarn/nm-local-dir/usercache/root/appcache/application_1427776730247_0008/container_1427776730247_0008_01_000006/./mapper.py":
error=2, No such file or directory
I think these are runtime directories created and yarn is having read write permission for this directory.
Do I need to manually change permission for these directories and place mapper and reducer file there?

error while executing python mapreduce tasks in hadoop?

I have written mapper and reducer for the wordcount example in python. The scripts works fine as a standalone ones. but I get error when run in hadoop.
I am using hadoop2.2
Here is my command:
hadoop jar share/hadoop/tools/sources/hadoop-streaming*.jar -mapper wordmapper.py -reducer wordreducer.py -file wordmapper.py -file wordreducer.py -input /data -output/output/result7
Exception in thread "main" java.lang.ClassNotFoundException: share.hadoop.tools.sources.hadoop-streaming-2.2.0-test-sources.jar
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:249)
at org.apache.hadoop.util.RunJar.main(RunJar.java:205)
how to fix this?
Can u please try it with
hadoop jar $HADOOP_PREFIX/hadoop/tools/sources/hadoop-streaming*.jar -mapper 'wordmapper.py' -reducer 'wordreducer.py' -file $CODE_FOLDER/wordmapper.py -file $CODE_FOLDER/wordreducer.py -input /data -output /output/result7
Where $HADOOP_PREFIX is folder location where the hadoop is placed on your machine.
for eg./usr/local/ for my machine.
If you can manually acces that location and check whether that jar is present.
And $CODE_FOLDER contains the code file where the script is saved.

Hadoop Streaming Job failed error in python

From this guide, I have successfully run the sample exercise. But on running my mapreduce job, I am getting the following error
ERROR streaming.StreamJob: Job not Successful!
10/12/16 17:13:38 INFO streaming.StreamJob: killJob...
Streaming Job Failed!
Error from the log file
java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 2
at org.apache.hadoop.streaming.PipeMapRed.waitOutputThreads(PipeMapRed.java:311)
at org.apache.hadoop.streaming.PipeMapRed.mapRedFinished(PipeMapRed.java:545)
at org.apache.hadoop.streaming.PipeMapper.close(PipeMapper.java:132)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:57)
at org.apache.hadoop.streaming.PipeMapRunner.run(PipeMapRunner.java:36)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Mapper.py
import sys
i=0
for line in sys.stdin:
i+=1
count={}
for word in line.strip().split():
count[word]=count.get(word,0)+1
for word,weight in count.items():
print '%s\t%s:%s' % (word,str(i),str(weight))
Reducer.py
import sys
keymap={}
o_tweet="2323"
id_list=[]
for line in sys.stdin:
tweet,tw=line.strip().split()
#print tweet,o_tweet,tweet_id,id_list
tweet_id,w=tw.split(':')
w=int(w)
if tweet.__eq__(o_tweet):
for i,wt in id_list:
print '%s:%s\t%s' % (tweet_id,i,str(w+wt))
id_list.append((tweet_id,w))
else:
id_list=[(tweet_id,w)]
o_tweet=tweet
[edit] command to run the job:
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper /home/hadoop/mapper.py -file /home/hadoop/reducer.py -reducer /home/hadoop/reducer.py -input my-input/* -output my-output
Input is any random sequence of sentences.
Thanks,
Your -mapper and -reducer should just be the script name.
hadoop#ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-0.20.0-streaming.jar -file /home/hadoop/mapper.py -mapper mapper.py -file /home/hadoop/reducer.py -reducer reducer.py -input my-input/* -output my-output
When your scripts are in the job that is in another folder within hdfs which is relative to the attempt task executing as "." (FYI if you ever want to ad another -file such as a look up table you can open it in Python as if it was in the same dir as your scripts while your script is in M/R job)
also make sure you have chmod a+x mapper.py and chmod a+x reducer.py
Try to add
#!/usr/bin/env python
top of your script.
Or,
-mapper 'python m.py' -reducer 'r.py'
You need to explicitly instruct that mapper and reducer are used as python script, as we have several options for streaming. You can use either single quotes or double quotes.
-mapper "python mapper.py" -reducer "python reducer.py"
or
-mapper 'python mapper.py' -reducer 'python reducer.py'
The full command goes like this:
hadoop jar /path/to/hadoop-mapreduce/hadoop-streaming.jar \
-input /path/to/input \
-output /path/to/output \
-mapper 'python mapper.py' \
-reducer 'python reducer.py' \
-file /path/to/mapper-script/mapper.py \
-file /path/to/reducer-script/reducer.py
I ran into this error recently, and my problem turned out to be something as obvious (in hindsight) as these other solutions:
I simply had a bug in my Python code. (In my case, I was using Python v2.7 string formatting whereas the AWS EMR cluster I had was using Python v2.6).
To find the actual Python error, go to Job Tracker web UI (in the case of AWS EMR, port 9100 for AMI 2.x and port 9026 for AMI 3.x); find the failed mapper; open its logs; and read the stderr output.
make sure your input directory only contains the correct files
I too had the same problem
i tried solution of marvin W
and i also install spark , ensure that u have installed spark , not just pyspark(dependency) but also install the framework installtion tutorial
follow that tutorial
if you run this command in a hadoop cluster, make sure that python is installed in every NodeMnager instance.
#hadoop

Categories