run python script with pyspark settings

run python script with pyspark settings - python

I have Spark configuration in spark-defaults.conf, xml files: core-site.xml, hive-site.xml, and I exported environment variables. When I run pyspark console:
$ pyspark --master yarn
and then:
>>> sqlContext.sql("show tables").show()
everything is correct, but when I use pure python interpreter I cannot see my tables.
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
conf = SparkConf().setMaster("yarn-client")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
sqlContext.sql("show tables").show()
How can I make python see all config files?

My understanding is that when you run the PySpark Shell, Spark is instantiated with support for Hive i.e HiveContext, which defaults to SQLContext.
But while running a Python Program or the Python Interpreter, as in your case, the SQLContext does not come with Hive support.
To fix this
sqlCtx = HiveContext(sc)
sqlCtx.sql("show tables").show()

Related

Use Pyspark inside the python project

I have Python project and trying to use pyspark within. I build a Python class and call Pyspark class and methods. I declare a SparkConf and create a configuration which ise used by Spark, then create a SparkSession with this conf. My spark environment is a cluster and I can use is as a cluster deploy mode and master as yarn. But, when I try to get instance of spark session, it can be seen on Yarn applications page as submitted, however it can not submit my python methods as a spark job. It works if i submit this python as single script, it can be run as job by Yarn.
Here is the sample code:
from pyspark import SparkConf
from pyspark.sql import SparkSession
conf = SparkConf
spark = SparkSession.builder.master('yarn')\
.config(conf=conf)\
.appName('myapp')\
.getOrCreate()
sc = spark.SparkContext()
rdd = sc.parallelize([1,2,3])
count = rdd.count()
print(sc.master)
print(count)
It works great when I submit it with ./bin/spark-submit myapp.py and I see it running on Yarn. It does not work how I expect when I run it python myapp.py and can see on Yarn as an application with no job or executer assigned.
Any help will be appreciated.
PS: I already set .env variables including with Hadoop conf dir, spark conf dir, etc. and conf files core-site and yarn-site xmls. Thus, I did not need to mention them.

What is the best practice to install IsolationForest in DataBrick platform for PySpark API?

I'm trying to install Isolation Forest package in DataBrick platform. The version of spark in databrick is 3.1.1:
print (pyspark.__version__)
#3.1.1
So I tried to follow this article to implement IsolationForest but I couldn't install the package from this repo with following steps:
Step 1. Package spark-iforest jar and deploy it into spark lib
cd spark-iforest/
mvn clean package -DskipTests
cp target/spark-iforest-.jar $SPARK_HOME/jars/
Step 2. Package pyspark-iforest and install it via pip, skip this step if you don't need the >python pkg
cd spark-iforest/python
python setup.py sdist
pip install dist/pyspark-iforest-.tar.gz
So basically I run following scripts and get: ModuleNotFoundError: No module named 'pyspark_iforest'
from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark_iforest.ml.iforest import IForest, IForestModel
import tempfile
conf = SparkConf()
conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("IForestExample") \
.getOrCreate()
temp_path = tempfile.mkdtemp()
iforest_path = temp_path + "/iforest"
model_path = temp_path + "/iforest_model"
What is the best practice to install IsolationForest in DataBrick platform for PySpark?

This specific version of isolation forest is compiled for the Spark 2.4 and Scala 2.11, and is binary incompatible with the Spark 3.1 that you're using. You may try to use Databricks Runtime (DBR) versions that are based on the Spark 2.4 - 6.4 or 5.4.
You may look onto the mmlspark (Microsoft Machine Learning for Apache Spark) library developed by Microsoft - it has an implementation of IsolationForest, although I haven't used it myself.

You need to import the jar first before using it:
from pyspark import SparkConf
from pyspark.sql import SparkSession, functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
import tempfile
conf = SparkConf()
conf.set('spark.jars', '/full/path/to/spark-iforest-2.4.0.jar')
spark = SparkSession \
.builder \
.config(conf=conf) \
.appName("IForestExample") \
.getOrCreate()
from pyspark_iforest.ml.iforest import IForest, IForestModel
temp_path = tempfile.mkdtemp()
iforest_path = temp_path + "/iforest"
model_path = temp_path + "/iforest_model"
I usually have the spark session created on a separate .py file or provide the spark.jars by using the spark-submit command, because the way jars are loaded sometimes give me trouble when adding them within the code only.
spark-submit --jars /full/path/to/spark-iforest-2.4.0.jar my_code.py
Also, there is a version mismatch, as #Alex Ott mentioned but the error would be different in that case. Building IForest with pyspark 3.x is not very difficult, but if you don't want to get into it, you could downgrade the pyspark version.

can I use mlflow python API to register a spark UDF & then use the UDF in Spark scala code?

I'm trying to use mlflow to do the machine learning work. I register the ML model as UDF using the following python code. The question is how can I use the UDF(test_predict) in my scala code? The reason is that our main code is in Scala. The problem is that UDF created below is a temporary UDF and SparkSession scoped. thanks!
import sys
import mlflow
from mlflow import pyfunc
import numpy as np
from pyspark import SparkContext, SparkConf
from pyspark import SQLContext
from pyspark.sql.session import SparkSession
from pyspark.sql.types import *
sc=SparkContext()
spark = SparkSession.builder.appName("Python UDF example").getOrCreate()
pyfunc_udf=mlflow.pyfunc.spark_udf(spark=spark, model_uri="./sk",result_type="float")
spark.udf.register("test_predict",pyfunc_udf)

Unable to connect to Mongo from pyspark

I'm trying to connect to MongoDB using pyspark. Below is the code I'm using
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
sparkConf = SparkConf().setAppName("App")
sparkConf.set("spark.mongodb.input.uri", "mongodb://127.0.0.1/mydb.test")
sc = SparkContext(conf = sparkConf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
I'm facing the below error
py4j.protocol.Py4JJavaError: An error occurred while calling o25.load.
java.lang.ClassNotFoundException: Failed to find data source: com.mongodb.spark.sql.DefaultSource.

Failed to find data source: com.mongodb.spark.sql.DefaultSource.
This error indicates that PySpark failed to locate MongoDB Spark Connector.
If you're invoking pyspark directly, make sure you specify mongo-spark-connector in the packages parameter. For example:
./bin/pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
If you're not invoking pyspark directly (i.e. from an IDE such as Eclipse) you would have to modify Spark configuration spark.jars.packages to specify the dependency.
Either within the spark-defaults.conf file:
spark.jars.packages org.mongodb.spark:mongo-spark-connector_2.11:2.2.0
Or, you can try changing the configuration within the code:
SparkConf().set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.11:2.2.0")
Or:
SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()

I had this problem. I could insert and find documents from pyspark shell but I could not using this code snippet within pyspark runtime:
SparkSession.builder.config('spark.jars.packages','org.mongodb.spark:mongo-spark-connector_2.11:2.2.0' ).getOrCreate()
You Also may need to add packages manually to $SPARK_HOME/jars. Consider this Answer.

Jupyter Notebook with Apache Spark (Kernel Error)

My objectif is to use Jupyter Notebook (IPython) with Apache Spark. I'm using Apache Toree to do this. I was setting environment variable for SPARK_HOME and configuring Apache Toree installation with Jupyter. Everything seems fine.
When I run the below command, a juypter browser is opened ipython notebook --profile=pyspark
When I choose Apache Toree - PySpark in the drop-down menu, I can't code in my notebook and I have this view (Python 2 is OK):
The red button gives :
What's wrong ? Help please ?

Not really an answer, but if you're not hooked on toree and just need a local spark for learning and experimenting, you could download a copy of spark, unzip it and use this in the beginning of your notebook:
import os
import sys
os.environ['SPARK_HOME']="<path where you have extracted the spark file>"
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'bin') )
sys.path.append( os.path.join(os.environ['SPARK_HOME'], 'python/lib/py4j-0.10.4-src.zip') )
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext, Row
import pyspark.sql.functions as sql
sc = SparkContext()
sqlContext = SQLContext(sc)
print sc.version

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

run python script with pyspark settings - python

Related

Use Pyspark inside the python project

What is the best practice to install IsolationForest in DataBrick platform for PySpark API?

can I use mlflow python API to register a spark UDF & then use the UDF in Spark scala code?

Unable to connect to Mongo from pyspark

Jupyter Notebook with Apache Spark (Kernel Error)

Categories

Resources