I am getting error while defining H2OContext in python spark script - python

Code:
from pyspark.sql import SparkSession
from pysparkling import *
hc = H2OContext.getOrCreate()
I am using spark standalone cluster 3.2.1 and try to initiate H2OContext in python file. while trying to run the script using spark-submit, i am getting following error:
hc = H2OContext.getOrCreate() NameError: name 'H2OContext' is not defined
Spark-submit command:
spark-submit --master spark://local:7077 --packages
ai.h2o:sparkling-water-package_2.12:3.36.1.3-1-3.2 spark_h20/h2o.py

The parameter --packages ai.h2o:sparkling-water-package_2.12:3.36.1.3-1-3.2 downloads a jar artifact from Maven. This artifact could be used only for Scala/Java. I see there is a mistake in Sparkling Water documentation.
If you want to use Python API, you need to:
Download SW zip archive from this location
Unzip the archive and go to the unzipped folder
Use the command spark-submit --master spark://local:7077 --py-files py/h2o_pysparkling_3.2-3.36.1.3-1-3.2.zip spark_h20/h2o.py for submitting the script to the cluster.

Related

Use Pyspark inside the python project

I have Python project and trying to use pyspark within. I build a Python class and call Pyspark class and methods. I declare a SparkConf and create a configuration which ise used by Spark, then create a SparkSession with this conf. My spark environment is a cluster and I can use is as a cluster deploy mode and master as yarn. But, when I try to get instance of spark session, it can be seen on Yarn applications page as submitted, however it can not submit my python methods as a spark job. It works if i submit this python as single script, it can be run as job by Yarn.
Here is the sample code:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf
spark = SparkSession.builder.master('yarn')\
.config(conf=conf)\
.appName('myapp')\
.getOrCreate()
sc = spark.SparkContext()
rdd = sc.parallelize([1,2,3])
count = rdd.count()
print(sc.master)
print(count)
It works great when I submit it with ./bin/spark-submit myapp.py and I see it running on Yarn. It does not work how I expect when I run it python myapp.py and can see on Yarn as an application with no job or executer assigned.
Any help will be appreciated.
PS: I already set .env variables including with Hadoop conf dir, spark conf dir, etc. and conf files core-site and yarn-site xmls. Thus, I did not need to mention them.

PySpark - Input path does not exist on YARN. Works fine locally

I am new to Spark and Python and I am trying to launch a Python script (through a bash run.sh command).
When I run it on local mode, everything is fine. When I try to run it in the cluster(which has spark 2.1.2 without hadoop) then I receive the same error.
I hope this info is enough.
What should I do so that the script runs in yarn?
from pyspark import SparkContext, SparkConf
import sys
import collections
import os
from subprocess import call, Popen
import numpy
import re
import requests
import json
import math
from bs4 import BeautifulSoup
from bs4.element import Comment
sc = SparkContext("yarn", "test")
record_attribute = sys.argv[1]
in_file = sys.argv[2]
#Read warc file and split in WARC/1.0
rdd = sc.newAPIHadoopFile(in_file,
"org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
"org.apache.hadoop.io.LongWritable",
"org.apache.hadoop.io.Text",
conf={"textinputformat.record.delimiter": "WARC/1.0"})
And this is the error
7/11/30 14:05:48 INFO spark.SparkContext: Created broadcast 1 from broadcast at PythonRDD.scala:553
Traceback (most recent call last):
File "/home/test/script.py", line 51, in <module>
,conf={"textinputformat.record.delimiter": "WARC/1.0"})
File "/home/test/spark-2.1.2-bin-without-hadoop/python/lib/pyspark.zip/pyspark/context.py", line 651, in newAPIHadoopFile
File "/home/test/spark-2.1.2-bin-without-hadoop/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/home/test/spark-2.1.2-bin-without-hadoop/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopFile. org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist
When I run it on local mode, everything is fine
You are loading a file from the local filesystem, then
When you deploy to the cluster, you need to ensure that in_file exists on all executors (YARN ResourceManagers), or on some shared filesystem, such as HDFS or S3
You should specify where you data is located in that bash script like hdfs:///some/input/in_file
If you did not copy your Hadoop cluster's core-site.xml and hdfs-site.xml into a local HADOOP_CONF_DIR environment variable, or otherwise configure it yourself, then the default behavior is to read a local filesystem, and you will therefore need to use the external filesystem URI path. For example, HDFS is in the format hdfs://namenode:port/some/input/in_file
Note: You need to first upload your file to the remote filesystem
Did you configure HADOOP_CONF_DIR or YARN_CONF_DIR properly? This directory should contain client configs of hdfs and yarn service. So that spark application can get resources from Yarn and applications can perform read/write operation to HDFS.
Kindly check the doc given below which gives you info about prerequisites to be followed for spark on yarn.
https://spark.apache.org/docs/2.1.1/running-on-yarn.html
If you deploy spark from cloudera manager or ambari server, All environment variables associated with the client configuration will be deployed by itself.
Running Python SparkPi in YARN Cluster Mode
Run the pi.py file:
spark-submit --master yarn --deploy-mode cluster SPARK_HOME/lib/pi.py 10
Please refer to the following link for more information: "Running Spark Applications on YARN".

import pymongo_spark doesn't work when executing with spark-commit

I am running into problem running my script with spark-submit. The main script won't even run because import pymongo_spark returns ImportError: No module named pymongo_spark
I checked this thread and this thread to try to figure out the issue, but so far there's no result.
My setup:
$HADOOP_HOME is set to /usr/local/cellar/hadoop/2.7.1 where my hadoop files are
$SPARK_HOME is set to /usr/local/cellar/apache_spark/1.5.2
I also followed those threads and guide online as close as possible to get
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
export PATH=$PATH:$HADOOP_HOME/bin
PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
then I used this piece of code to test in the first thread I linked
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName('pyspark test')
sc = SparkContext(conf=conf)
if __name__ == '__main__':
main()
Then in the terminal, I did:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --master local[4] ~/Documents/pysparktest.py
Where mongo-hadoop-r1.4.2-1.4.2.jar is the jar I built following this guide
I'm definitely missing things, but I'm not sure where/what I'm missing. I'm running everything locally on Mac OSX El Capitan. Almost sure this doesn't matter, but just wanna add it in.
EDIT:
I also used another jar file mongo-hadoop-1.5.0-SNAPSHOT.jar, the same problem remains
my command:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --master local[4] ~/Documents/pysparktest.py
pymongo_spark is available only on mongo-hadoop 1.5 so it won't work with mongo-hadoop 1.4. To make it available you have to add directory with Python package to the PYTHONPATH as well. If you've built package by yourself it is located in spark/src/main/python/.
export PYTHONPATH=$PYTHONPATH:$MONGO_SPARK_SRC/src/main/python
where MONGO_SPARK_SRC is a directory with Spark Connector source.
See also Getting Spark, Python, and MongoDB to work together

Include package in Spark local mode

I'm writing some unit tests for my Spark code in python. My code depends on spark-csv. In production I use spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 to submit my python script.
I'm using pytest to run my tests with Spark in local mode:
conf = SparkConf().setAppName('myapp').setMaster('local[1]')
sc = SparkContext(conf=conf)
My question is, since pytest isn't using spark-submit to run my code, how can I provide my spark-csv dependency to the python process?
you can use your config file spark.driver.extraClassPath to sort out the problem.
Spark-default.conf
and add the property
spark.driver.extraClassPath /Volumes/work/bigdata/CHD5.4/spark-1.4.0-bin-hadoop2.6/lib/spark-csv_2.11-1.1.0.jar:/Volumes/work/bigdata/CHD5.4/spark-1.4.0-bin-hadoop2.6/lib/commons-csv-1.1.jar
After setting the above you even don't need packages flag while running from shell.
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load(BASE_DATA_PATH + '/ssi.csv')
Both the jars are important, as spark-csv depends on commons-csv apache jar. The spark-csv jar you can either build or download from mvn-site.

Read a folder of parquet files from s3 location using pyspark to pyspark dataframe

I want to read some parquet files present in a folder poc/folderName on s3 bucket myBucketName to a pyspark dataframe. I am using pyspark v2.4.3 for the same.
below is the code which i am using
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", 'id')
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", 'sid')
sqlContext = SQLContext(sc)
parquetDF = sqlContext.read.parquet("s3a://myBucketName/poc/folderName")
I have downloaded the hadoop-aws package using command pyspark --packages org.apache.hadoop:hadoop-aws:3.3.0 but when i run above code i am receive below error.
An error occurred while calling o825.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
What is the thing that I am doing wrong here?
I am running the python code using Anaconda and spyder on windows 10
The Maven coordinates for the open source Hadoop S3 driver need to be added as a package dependency:
spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.0
Note the above package version is tied to the installed AWS SDK for Java version.
In the Spark application's code, something like the following may also be needed:
hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
Note that when using the open source Hadoop driver, the S3 URI scheme is s3a and not s3 (as it is when using Spark on EMR and Amazon's proprietary EMRFS). e.g. s3a://bucket-name/
Credits to danielchalef

Categories