PySpark - Input path does not exist on YARN. Works fine locally - python

I am new to Spark and Python and I am trying to launch a Python script (through a bash run.sh command).
When I run it on local mode, everything is fine. When I try to run it in the cluster(which has spark 2.1.2 without hadoop) then I receive the same error.
I hope this info is enough.
What should I do so that the script runs in yarn?
from pyspark import SparkContext, SparkConf
import sys
import collections
import os
from subprocess import call, Popen
import numpy
import re
import requests
import json
import math
from bs4 import BeautifulSoup
from bs4.element import Comment
sc = SparkContext("yarn", "test")
record_attribute = sys.argv[1]
in_file = sys.argv[2]
#Read warc file and split in WARC/1.0
rdd = sc.newAPIHadoopFile(in_file,
"org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
"org.apache.hadoop.io.LongWritable",
"org.apache.hadoop.io.Text",
conf={"textinputformat.record.delimiter": "WARC/1.0"})
And this is the error
7/11/30 14:05:48 INFO spark.SparkContext: Created broadcast 1 from broadcast at PythonRDD.scala:553
Traceback (most recent call last):
File "/home/test/script.py", line 51, in <module>
,conf={"textinputformat.record.delimiter": "WARC/1.0"})
File "/home/test/spark-2.1.2-bin-without-hadoop/python/lib/pyspark.zip/pyspark/context.py", line 651, in newAPIHadoopFile
File "/home/test/spark-2.1.2-bin-without-hadoop/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/home/test/spark-2.1.2-bin-without-hadoop/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist

When I run it on local mode, everything is fine
You are loading a file from the local filesystem, then
When you deploy to the cluster, you need to ensure that in_file exists on all executors (YARN ResourceManagers), or on some shared filesystem, such as HDFS or S3
You should specify where you data is located in that bash script like hdfs:///some/input/in_file
If you did not copy your Hadoop cluster's core-site.xml and hdfs-site.xml into a local HADOOP_CONF_DIR environment variable, or otherwise configure it yourself, then the default behavior is to read a local filesystem, and you will therefore need to use the external filesystem URI path. For example, HDFS is in the format hdfs://namenode:port/some/input/in_file
Note: You need to first upload your file to the remote filesystem

Did you configure HADOOP_CONF_DIR or YARN_CONF_DIR properly? This directory should contain client configs of hdfs and yarn service. So that spark application can get resources from Yarn and applications can perform read/write operation to HDFS.
Kindly check the doc given below which gives you info about prerequisites to be followed for spark on yarn.
https://spark.apache.org/docs/2.1.1/running-on-yarn.html
If you deploy spark from cloudera manager or ambari server, All environment variables associated with the client configuration will be deployed by itself.

Running Python SparkPi in YARN Cluster Mode
Run the pi.py file:
spark-submit --master yarn --deploy-mode cluster SPARK_HOME/lib/pi.py 10
Please refer to the following link for more information: "Running Spark Applications on YARN".

Related

I am getting error while defining H2OContext in python spark script

Code:
from pyspark.sql import SparkSession
from pysparkling import *
hc = H2OContext.getOrCreate()
I am using spark standalone cluster 3.2.1 and try to initiate H2OContext in python file. while trying to run the script using spark-submit, i am getting following error:
hc = H2OContext.getOrCreate() NameError: name 'H2OContext' is not defined
Spark-submit command:
spark-submit --master spark://local:7077 --packages
ai.h2o:sparkling-water-package_2.12:3.36.1.3-1-3.2 spark_h20/h2o.py
The parameter --packages ai.h2o:sparkling-water-package_2.12:3.36.1.3-1-3.2 downloads a jar artifact from Maven. This artifact could be used only for Scala/Java. I see there is a mistake in Sparkling Water documentation.
If you want to use Python API, you need to:
Download SW zip archive from this location
Unzip the archive and go to the unzipped folder
Use the command spark-submit --master spark://local:7077 --py-files py/h2o_pysparkling_3.2-3.36.1.3-1-3.2.zip spark_h20/h2o.py for submitting the script to the cluster.

Apache Spark with Python: error

New to Spark. Downloaded everything alright but when I run pyspark I get the following errors:
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
17/02/05 20:46:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\bin\..\python\pyspark\shell.py", line 43, in <module>
spark = SparkSession.builder\
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\python\pyspark\sql\session.py", line 179, in getOrCreate
session._jsparkSession.sessionState().conf().setConfString(key, value)
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
File "C:\Users\Carolina\spark-2.1.0-bin-hadoop2.7\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\spark-2.1.0-bin-hadoop2.6\python\pyspark\sql\utils.py", line 79, in deco
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionState':"
Also, when I try (as recommended by http://spark.apache.org/docs/latest/quick-start.html)
textFile = sc.textFile("README.md")
I get:
NameError: name 'sc' is not defined
Any advice? Thank you!
If you are doing it from the pyspark console, it may be because your installation did not work.
If not, it's because most example assume you are testing code in the pyspark console where a default variable 'sc' exist.
You can create a SparkContext by yourself at the beginning of your script using the following code:
from pyspark import SparkContext, SparkConf
conf = SparkConf()
sc = SparkContext(conf=conf)
It looks like you've found the answer to the second part of your question in the above answer, but for future users getting here via the 'org.apache.spark.sql.hive.HiveSessionState' error, this class is found in the spark-hive jar file, which does not come bundled with Spark if it isn't built with Hive.
You can get this jar at:
http://central.maven.org/maven2/org/apache/spark/spark-hive_${SCALA_VERSION}/${SPARK_VERSION}/spark-hive_${SCALA_VERSION}-${SPARK_VERSION}.jar
You'll have to put it into your SPARK_HOME/jars folder, and then Spark should be able to find all of the Hive classes required.
I also encountered this issue on Windows 7 with pre-built Spark 2.2. Here is a possible solution for Windows guys:
make sure you get all the environment path set correctly, including SPARK_PATH, HADOOP_HOME, etc.
get the correct version of winutils.exe for the Spark-Hadoop prebuilt package
then open a cmd prompt as Administration, run this command:
winutils chmod 777 C:\tmp\hive
Note: The drive might be different depending on where you invoke pyspark or spark-shell
This link should take the credit: see the answer by timesking
If you're on a Mac and you've installed Spark (and eventually Hive) through Homebrew the answers from #Eric Pettijohn and #user7772046 will not work. The former due to the fact that Homebrew's Spark contains the aforementioned jar file; the latter because, trivially, it is a pure Windows-based solution.
Inspired by this link and the permission issues hint, I've come up with the following simple solution: launch pyspark using sudo. No more Hive-related errors.
I deleted the metastore_db directory and then things worked. I'm doing some light development on a macbook -- I had run pycharm to sync my directory with the server - I thin it picked up that spark specific directory and messed it up. For my the the error message came when I was trying to start an interactive ipython pyspark shell.
With my problem like this, because I have set the Hadoop at yarn model, so my solution is to start the hdfs and the YARN.
start-dfs.sh
start-yarn.sh
I come across the error:
raise IllegalArgumentException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.IllegalArgumentException: u"Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder'
this is because i already run ./bin/spark-shell
So, just kill that spark-shell, and re-run ./bin/pyspark
You need a "winutils" competable in the hadoop bin directory.

import pymongo_spark doesn't work when executing with spark-commit

I am running into problem running my script with spark-submit. The main script won't even run because import pymongo_spark returns ImportError: No module named pymongo_spark
I checked this thread and this thread to try to figure out the issue, but so far there's no result.
My setup:
$HADOOP_HOME is set to /usr/local/cellar/hadoop/2.7.1 where my hadoop files are
$SPARK_HOME is set to /usr/local/cellar/apache_spark/1.5.2
I also followed those threads and guide online as close as possible to get
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
export PATH=$PATH:$HADOOP_HOME/bin
PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
then I used this piece of code to test in the first thread I linked
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName('pyspark test')
sc = SparkContext(conf=conf)
if __name__ == '__main__':
main()
Then in the terminal, I did:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --master local[4] ~/Documents/pysparktest.py
Where mongo-hadoop-r1.4.2-1.4.2.jar is the jar I built following this guide
I'm definitely missing things, but I'm not sure where/what I'm missing. I'm running everything locally on Mac OSX El Capitan. Almost sure this doesn't matter, but just wanna add it in.
EDIT:
I also used another jar file mongo-hadoop-1.5.0-SNAPSHOT.jar, the same problem remains
my command:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --master local[4] ~/Documents/pysparktest.py
pymongo_spark is available only on mongo-hadoop 1.5 so it won't work with mongo-hadoop 1.4. To make it available you have to add directory with Python package to the PYTHONPATH as well. If you've built package by yourself it is located in spark/src/main/python/.
export PYTHONPATH=$PYTHONPATH:$MONGO_SPARK_SRC/src/main/python
where MONGO_SPARK_SRC is a directory with Spark Connector source.
See also Getting Spark, Python, and MongoDB to work together

Unable to import gsutil

I feel I set everything up correctly. I followed these instructions.
and installed from the tar file.
My home directory has a folder "gsutil" now. I ran through the configuration to set my app up for oauth2, and am able to call gsutil from the command line. To use gsutil and Google App Engine, I added the following lines to the .bashrc file in my Home directory and sourced it:
export PATH=$PATH:$HOME/google_appengine
export PATH=${PATH}:$HOME/gsutil
export PYTHONPATH=${PYTHONPATH}:$HOME/gsutil/third_party/boto:$HOME/gsutil
However, when I try to import in my python script by either:
import gsutil
Or something like this (straight from the documentation).
from gslib.third_party.oauth2_plugin import oauth2_plugin
I get errors like:
ImportError: No module named gslib.third_party.oauth2_plugin
Did I miss a step somewhere? Thanks
EDIT:
Here is the output of (','.join(sys.path)):
import sys; print(', '.join(sys.path))
, /usr/local/lib/python2.7/dist-packages/setuptools-1.4.1-py2.7.egg, /usr/local/lib/python2.7/dist-packages/pip-1.4.1-py2.7.egg, /usr/local/lib/python2.7/dist-packages/gsutil-3.40-py2.7.egg, /home/[myname], /home/[myname]/gsutil/third_party/boto, /home/[myname]/gsutil, /usr/lib/python2.7, /usr/lib/python2.7/plat-linux2, /usr/lib/python2.7/lib-tk, /usr/lib/python2.7/lib-old, /usr/lib/python2.7/lib-dynload, /usr/local/lib/python2.7/dist-packages, /usr/lib/python2.7/dist-packages, /usr/lib/python2.7/dist-packages/PIL, /usr/lib/python2.7/dist-packages/gst-0.10, /usr/lib/python2.7/dist-packages/gtk-2.0, /usr/lib/python2.7/dist-packages/ubuntu-sso-client, /usr/lib/python2.7/dist-packages/ubuntuone-client, /usr/lib/python2.7/dist-packages/ubuntuone-control-panel, /usr/lib/python2.7/dist-packages/ubuntuone-couch, /usr/lib/python2.7/dist-packages/ubuntuone-installer, /usr/lib/python2.7/dist-packages/ubuntuone-storage-protocol
EDIT 2:
I can import the module from the command line, but can't from within my Google App Engine app..
Here is the first line of the output using python -v
import gsutil
/home/adrian/gsutil/gsutil.pyc matches /home/adrian/gsutil/gsutil.py
But when I try to import it from an app, I get this message:
import gsutil
ImportError: No module named gsutil
gsutil is intended to only be used from the command line. If you want to interact with cloud storage from within an appengine application you should be using the cloud storage client library: https://developers.google.com/appengine/docs/java/googlecloudstorageclient/

How to run code on the AWS cluster using Apache-Spark?

I've written a python code on summing up all numbers in first-column for each csv file which is as follow:
import os, sys, inspect, csv
### Current directory path.
curr_dir = os.path.split(inspect.getfile(inspect.currentframe()))[0]
### Setup the environment variables
spark_home_dir = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "../spark")))
python_dir = os.path.realpath(os.path.abspath(os.path.join(spark_home_dir, "./python")))
os.environ["SPARK_HOME"] = spark_home_dir
os.environ["PYTHONPATH"] = python_dir
### Setup pyspark directory path
pyspark_dir = python_dir
sys.path.append(pyspark_dir)
### Import the pyspark
from pyspark import SparkConf, SparkContext
### Specify the data file directory, and load the data files
data_path = os.path.realpath(os.path.abspath(os.path.join(curr_dir, "./test_dir")))
### myfunc is to add all numbers in the first column.
def myfunc(s):
total = 0
if s.endswith(".csv"):
cr = csv.reader(open(s,"rb"))
for row in cr:
total += int(row[0])
return total
def main():
### Initialize the SparkConf and SparkContext
conf = SparkConf().setAppName("ruofan").setMaster("spark://ec2-52-26-177-197.us-west-2.compute.amazonaws.com:7077")
sc = SparkContext(conf = conf)
datafile = sc.wholeTextFiles(data_path)
### Sent the application in each of the slave node
temp = datafile.map(lambda (path, content): myfunc(str(path).strip('file:')))
### Collect the result and print it out.
for x in temp.collect():
print x
if __name__ == "__main__":
main()
I would like to use Apache-Spark to parallelize the summation process for several csv files using the same python code. I've already done the following steps:
I've created one master and two slave nodes on AWS.
I've used the bash command $ scp -r -i my-key-pair.pem my_dir root#ec2-52-27-82-124.us-west-2.compute.amazonaws.com to upload directory my_dir including my python code with the csv files onto the cluster master node.
I've login my master node, and from there used the bash command $ ./spark/copy-dir my_dir to send my python code as well as csv files to all slave nodes.
I've setup the environment variables on the master node:
$ export SPARK_HOME=~/spark
$ export PYTHONPATH=$SPARK_HOME/python/:$PYTHONPATH
However, when I run the python code on the master node: $ python sum.py, it shows up the following error:
Traceback (most recent call last):
File "sum.py", line 18, in <module>
from pyspark import SparkConf, SparkContext
File "/root/spark/python/pyspark/__init__.py", line 41, in <module>
from pyspark.context import SparkContext
File "/root/spark/python/pyspark/context.py", line 31, in <module>
from pyspark.java_gateway import launch_gateway
File "/root/spark/python/pyspark/java_gateway.py", line 31, in <module>
from py4j.java_gateway import java_import, JavaGateway, GatewayClient
ImportError: No module named py4j.java_gateway
I have no ideas about this error. Also, I am wondering if the master node automatically calls all slave nodes to run in parallel. I really appreciate if anyone can help me.
Here is how I would debug this particular import error.
ssh to your master node
Run the python REPL with $ python
Try the failing import line >> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
If it fails, try simply running >> import py4j
If that fails, it means that your system either does not have py4j installed or cannot find it.
Exit the REPL >> exit()
Try installing py4j $ pip install py4j (you'll need to have pip installed)
Open the REPL $ python
Try importing again >> from py4j.java_gateway import java_import, JavaGateway, GatewayClient
If that works, then >> exit() and try running your $ python sum.py again
I think you are asking two separate questions. It looks like you have an import error. Is it possible that you have a different version of the package py4j installed on your local computer that you haven't installed on your master node?
I can't help with running this in parallel.

Categories