Execute python script with spark - python

I want to pass a python test into SparkContext within my jupyter notebook and have the output shown in the notebook as well. To test, I'm simply executing my jupyter notebook like so:
sparkConf = SparkConf()
sc = SparkContext(conf=sparkConf)
sc.addPyFile('test.py')
With test.py looking like
rdd = sc.parallelize(range(100000000))
print(rdd.sum())
But when I execute the sc.addPyFile line in my notebook, I do not see the output. Am I passing the pyspark script into my SparkContext incorrectly?

The function you are using is not used to trigger the job instead it pass the python module to the sparkContext so that it can be imported in the script as needed.
See here:
https://spark.apache.org/docs/0.7.3/api/pyspark/pyspark.context.SparkContext-class.html#addPyFile
To trigger a job you need to run
spark-submit test.py outside of your jupyter notebook.

Related

Use Pyspark inside the python project

I have Python project and trying to use pyspark within. I build a Python class and call Pyspark class and methods. I declare a SparkConf and create a configuration which ise used by Spark, then create a SparkSession with this conf. My spark environment is a cluster and I can use is as a cluster deploy mode and master as yarn. But, when I try to get instance of spark session, it can be seen on Yarn applications page as submitted, however it can not submit my python methods as a spark job. It works if i submit this python as single script, it can be run as job by Yarn.
Here is the sample code:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf
spark = SparkSession.builder.master('yarn')\
.config(conf=conf)\
.appName('myapp')\
.getOrCreate()
sc = spark.SparkContext()
rdd = sc.parallelize([1,2,3])
count = rdd.count()
print(sc.master)
print(count)
It works great when I submit it with ./bin/spark-submit myapp.py and I see it running on Yarn. It does not work how I expect when I run it python myapp.py and can see on Yarn as an application with no job or executer assigned.
Any help will be appreciated.
PS: I already set .env variables including with Hadoop conf dir, spark conf dir, etc. and conf files core-site and yarn-site xmls. Thus, I did not need to mention them.

Pytest does not output junitxml when running in Databricks repo

We have a Databricks platform where repos and files in repos are enabled. As such, we can have .py files within the repos which can be called by Databricks notebooks.
We are currently testing the viability of running our unit tests on Databricks clusters instead of using a (PySpark) image in our Git / CI environment.
The repo within Databricks looks like
| - notebook
| - mycode.py
| - mycode_test.py
Here, mycode.py contains a function that applies a transformation on a Spark Dataframe. The file mycode_test.py contains an unit test build with pytest (and some fixtures to create test data and handling the Spark session / Spark context).
We run pytest from the notebook, instead of from the command line. Hence, the Databricks notebook looks like:
import pytest
retcode = pytest.main(['-k', 'mycode_test',
'-o', 'cache_dir=/dbfs/FileStore/',
'--junitxml', '/dbfs/FileStore/pytestreport.xml',
'-v'
]
)
This code snippet runs fine on a standard Databricks cluster (with runtime 10.4 LTS and pytest installed) and the results of the unit testing are printed out below the cell.
However, no output is stored at the cache directory or the pointer for the junit xml file.
Questions:
Are we missing something here?
Can we assume that it actually generated output at an unknown location because the pytest.main did not crash?
Are the .fuse-mounts within Databricks causing the issue here?
It seemed that I made some mistakes during my initial setup of the paths in the pytest.main command. I have updated these paths now and they work.
Thus, the snippet below generates the XML and caching files in the databricks FileStore.
Again, this probably only works when you are working within a Databricks Repo with files in repos enabled.
import pytest
retcode = pytest.main(['-k', 'mycode_test',
'-o', 'cache_dir=/dbfs/FileStore/',
'--junitxml', '/dbfs/FileStore/pytestreport.xml',
'-v'
]
)

How to import one databricks notebook into another?

I have a python notebook A in Azure Databricks having import statement as below:
import xyz, datetime, ...
I have another notebook xyz being imported in notebook A as shown in above code.
When I run notebook A, it throws the following error:
ImportError: No module named xyz
Both notebooks are in the same workspace directory. Can anyone help in resolving this?
The only way to import notebooks is by using the run command:
%run /Shared/MyNotebook
or relative path:
%run ./MyNotebook
More details: https://docs.azuredatabricks.net/user-guide/notebooks/notebook-workflows.html
To get the result back as a DataFrame from different notebook in Databricks we can do as below.
noebook1
def func1(arg):
df=df.transfomationlogic
return df
notbook2
%run path-of-notebook1
df=func1(**dfinput**)
Here the dfinput is a df you are passing and you will get the
transformed df back from func1.

Unable to connect to Mongo from pyspark

I'm trying to connect to MongoDB using pyspark. Below is the code I'm using
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sparkConf = SparkConf().setAppName("App")
sparkConf.set("spark.mongodb.input.uri", "mongodb://127.0.0.1/mydb.test")
sc = SparkContext(conf = sparkConf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
I'm facing the below error
py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource.
Failed to find data source: com.mongodb.spark.sql.DefaultSource.
This error indicates that PySpark failed to locate MongoDB Spark Connector.
If you're invoking pyspark directly, make sure you specify mongo-spark-connector in the packages parameter. For example:
./bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
If you're not invoking pyspark directly (i.e. from an IDE such as Eclipse) you would have to modify Spark configuration spark.jars.packages to specify the dependency.
Either within the spark-defaults.conf file:
spark.jars.packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
Or, you can try changing the configuration within the code:
SparkConf().set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
Or:
SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()
I had this problem. I could insert and find documents from pyspark shell but I could not using this code snippet within pyspark runtime:
SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()
You Also may need to add packages manually to $SPARK_HOME/jars. Consider this Answer.

import pymongo_spark doesn't work when executing with spark-commit

I am running into problem running my script with spark-submit. The main script won't even run because import pymongo_spark returns ImportError: No module named pymongo_spark
I checked this thread and this thread to try to figure out the issue, but so far there's no result.
My setup:
$HADOOP_HOME is set to /usr/local/cellar/hadoop/2.7.1 where my hadoop files are
$SPARK_HOME is set to /usr/local/cellar/apache_spark/1.5.2
I also followed those threads and guide online as close as possible to get
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
export PATH=$PATH:$HADOOP_HOME/bin
PYTHONPATH=$SPARK_HOME/libexec/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
then I used this piece of code to test in the first thread I linked
from pyspark import SparkContext, SparkConf
import pymongo_spark
pymongo_spark.activate()
def main():
conf = SparkConf().setAppName('pyspark test')
sc = SparkContext(conf=conf)
if __name__ == '__main__':
main()
Then in the terminal, I did:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-r1.4.2-1.4.2.jar --master local[4] ~/Documents/pysparktest.py
Where mongo-hadoop-r1.4.2-1.4.2.jar is the jar I built following this guide
I'm definitely missing things, but I'm not sure where/what I'm missing. I'm running everything locally on Mac OSX El Capitan. Almost sure this doesn't matter, but just wanna add it in.
EDIT:
I also used another jar file mongo-hadoop-1.5.0-SNAPSHOT.jar, the same problem remains
my command:
$SPARK_HOME/bin/spark-submit --jars $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --driver-class-path $HADOOP_HOME/libexec/share/hadoop/mapreduce/mongo-hadoop-1.5.0-SNAPSHOT.jar --master local[4] ~/Documents/pysparktest.py
pymongo_spark is available only on mongo-hadoop 1.5 so it won't work with mongo-hadoop 1.4. To make it available you have to add directory with Python package to the PYTHONPATH as well. If you've built package by yourself it is located in spark/src/main/python/.
export PYTHONPATH=$PYTHONPATH:$MONGO_SPARK_SRC/src/main/python
where MONGO_SPARK_SRC is a directory with Spark Connector source.
See also Getting Spark, Python, and MongoDB to work together

Categories