Include package in Spark local mode - python

I'm writing some unit tests for my Spark code in python. My code depends on spark-csv. In production I use spark-submit --packages com.databricks:spark-csv_2.10:1.0.3 to submit my python script.
I'm using pytest to run my tests with Spark in local mode:
conf = SparkConf().setAppName('myapp').setMaster('local[1]')
sc = SparkContext(conf=conf)
My question is, since pytest isn't using spark-submit to run my code, how can I provide my spark-csv dependency to the python process?

you can use your config file spark.driver.extraClassPath to sort out the problem.
Spark-default.conf
and add the property
spark.driver.extraClassPath /Volumes/work/bigdata/CHD5.4/spark-1.4.0-bin-hadoop2.6/lib/spark-csv_2.11-1.1.0.jar:/Volumes/work/bigdata/CHD5.4/spark-1.4.0-bin-hadoop2.6/lib/commons-csv-1.1.jar
After setting the above you even don't need packages flag while running from shell.
sqlContext = SQLContext(sc)
df = sqlContext.read.format('com.databricks.spark.csv').options(header='false').load(BASE_DATA_PATH + '/ssi.csv')
Both the jars are important, as spark-csv depends on commons-csv apache jar. The spark-csv jar you can either build or download from mvn-site.

Related

Use Pyspark inside the python project

I have Python project and trying to use pyspark within. I build a Python class and call Pyspark class and methods. I declare a SparkConf and create a configuration which ise used by Spark, then create a SparkSession with this conf. My spark environment is a cluster and I can use is as a cluster deploy mode and master as yarn. But, when I try to get instance of spark session, it can be seen on Yarn applications page as submitted, however it can not submit my python methods as a spark job. It works if i submit this python as single script, it can be run as job by Yarn.
Here is the sample code:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf
spark = SparkSession.builder.master('yarn')\
.config(conf=conf)\
.appName('myapp')\
.getOrCreate()
sc = spark.SparkContext()
rdd = sc.parallelize([1,2,3])
count = rdd.count()
print(sc.master)
print(count)
It works great when I submit it with ./bin/spark-submit myapp.py and I see it running on Yarn. It does not work how I expect when I run it python myapp.py and can see on Yarn as an application with no job or executer assigned.
Any help will be appreciated.
PS: I already set .env variables including with Hadoop conf dir, spark conf dir, etc. and conf files core-site and yarn-site xmls. Thus, I did not need to mention them.

I am getting error while defining H2OContext in python spark script

Code:
from pyspark.sql import SparkSession
from pysparkling import *
hc = H2OContext.getOrCreate()
I am using spark standalone cluster 3.2.1 and try to initiate H2OContext in python file. while trying to run the script using spark-submit, i am getting following error:
hc = H2OContext.getOrCreate() NameError: name 'H2OContext' is not defined
Spark-submit command:
spark-submit --master spark://local:7077 --packages
ai.h2o:sparkling-water-package_2.12:3.36.1.3-1-3.2 spark_h20/h2o.py
The parameter --packages ai.h2o:sparkling-water-package_2.12:3.36.1.3-1-3.2 downloads a jar artifact from Maven. This artifact could be used only for Scala/Java. I see there is a mistake in Sparkling Water documentation.
If you want to use Python API, you need to:
Download SW zip archive from this location
Unzip the archive and go to the unzipped folder
Use the command spark-submit --master spark://local:7077 --py-files py/h2o_pysparkling_3.2-3.36.1.3-1-3.2.zip spark_h20/h2o.py for submitting the script to the cluster.

Pytest does not output junitxml when running in Databricks repo

We have a Databricks platform where repos and files in repos are enabled. As such, we can have .py files within the repos which can be called by Databricks notebooks.
We are currently testing the viability of running our unit tests on Databricks clusters instead of using a (PySpark) image in our Git / CI environment.
The repo within Databricks looks like
| - notebook
| - mycode.py
| - mycode_test.py
Here, mycode.py contains a function that applies a transformation on a Spark Dataframe. The file mycode_test.py contains an unit test build with pytest (and some fixtures to create test data and handling the Spark session / Spark context).
We run pytest from the notebook, instead of from the command line. Hence, the Databricks notebook looks like:
import pytest
retcode = pytest.main(['-k', 'mycode_test',
'-o', 'cache_dir=/dbfs/FileStore/',
'--junitxml', '/dbfs/FileStore/pytestreport.xml',
'-v'
]
)
This code snippet runs fine on a standard Databricks cluster (with runtime 10.4 LTS and pytest installed) and the results of the unit testing are printed out below the cell.
However, no output is stored at the cache directory or the pointer for the junit xml file.
Questions:
Are we missing something here?
Can we assume that it actually generated output at an unknown location because the pytest.main did not crash?
Are the .fuse-mounts within Databricks causing the issue here?
It seemed that I made some mistakes during my initial setup of the paths in the pytest.main command. I have updated these paths now and they work.
Thus, the snippet below generates the XML and caching files in the databricks FileStore.
Again, this probably only works when you are working within a Databricks Repo with files in repos enabled.
import pytest
retcode = pytest.main(['-k', 'mycode_test',
'-o', 'cache_dir=/dbfs/FileStore/',
'--junitxml', '/dbfs/FileStore/pytestreport.xml',
'-v'
]
)

import pymongo_spark doesn't work when executing with spark-commit

I am running into problem running my script with spark-submit. The main script won't even run because import pymongo_spark returns ImportError: No module named pymongo_spark
I checked this thread and this thread to try to figure out the issue, but so far there's no result.
My setup:
$HADOOP_HOME is set to /usr/local/cellar/hadoop/2.7.1 where my hadoop files are
$SPARK_HOME is set to /usr/local/cellar/apache_spark/1.5.2
I also followed those threads and guide online as close as possible to get
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
export PATH=$PATH:$HADOOP_HOME/bin
PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
then I used this piece of code to test in the first thread I linked
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName('pyspark test')
sc = SparkContext(conf=conf)
if __name__ == '__main__':
main()
Then in the terminal, I did:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --master local[4] ~/Documents/pysparktest.py
Where mongo-hadoop-r1.4.2-1.4.2.jar is the jar I built following this guide
I'm definitely missing things, but I'm not sure where/what I'm missing. I'm running everything locally on Mac OSX El Capitan. Almost sure this doesn't matter, but just wanna add it in.
EDIT:
I also used another jar file mongo-hadoop-1.5.0-SNAPSHOT.jar, the same problem remains
my command:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --master local[4] ~/Documents/pysparktest.py
pymongo_spark is available only on mongo-hadoop 1.5 so it won't work with mongo-hadoop 1.4. To make it available you have to add directory with Python package to the PYTHONPATH as well. If you've built package by yourself it is located in spark/src/main/python/.
export PYTHONPATH=$PYTHONPATH:$MONGO_SPARK_SRC/src/main/python
where MONGO_SPARK_SRC is a directory with Spark Connector source.
See also Getting Spark, Python, and MongoDB to work together

Read a folder of parquet files from s3 location using pyspark to pyspark dataframe

I want to read some parquet files present in a folder poc/folderName on s3 bucket myBucketName to a pyspark dataframe. I am using pyspark v2.4.3 for the same.
below is the code which i am using
sc = SparkContext.getOrCreate()
sc._jsc.hadoopConfiguration().set("fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
sc._jsc.hadoopConfiguration().set("fs.s3a.awsAccessKeyId", 'id')
sc._jsc.hadoopConfiguration().set("fs.s3a.awsSecretAccessKey", 'sid')
sqlContext = SQLContext(sc)
parquetDF = sqlContext.read.parquet("s3a://myBucketName/poc/folderName")
I have downloaded the hadoop-aws package using command pyspark --packages org.apache.hadoop:hadoop-aws:3.3.0 but when i run above code i am receive below error.
An error occurred while calling o825.parquet.
: java.lang.RuntimeException: java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found
at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2654)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2667)
What is the thing that I am doing wrong here?
I am running the python code using Anaconda and spyder on windows 10
The Maven coordinates for the open source Hadoop S3 driver need to be added as a package dependency:
spark-submit --packages org.apache.hadoop:hadoop-aws:2.7.0
Note the above package version is tied to the installed AWS SDK for Java version.
In the Spark application's code, something like the following may also be needed:
hadoop_conf = spark.sparkContext._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
hadoop_conf.set("fs.s3a.access.key", access_id)
hadoop_conf.set("fs.s3a.secret.key", access_key)
Note that when using the open source Hadoop driver, the S3 URI scheme is s3a and not s3 (as it is when using Spark on EMR and Amazon's proprietary EMRFS). e.g. s3a://bucket-name/
Credits to danielchalef

Categories