Set PySpark Serializer in PySpark Builder - python

I am using PySpark 2.1.1 and am trying to set the serializer when using Spark Submit. In my application, I initialize the SparkSession.builder as follows
print("creating spark session")
spark = SparkSession.builder.master("yarn").appName("AppName").\
config("driver-library-path","libPath")).\
config("driver-java-options",driverJavaOptions).\
enableHiveSupport().\
config("deploy-mode","client").\
config("spark.serializer","PickleSerializer").\
config("spark.executor.instances",100).\
config("spark.executor.memory","4g").\
getOrCreate()
I am getting the following error
java.lang.ClassNotFoundException: PickleSerializer
What is the right way to initialize the serializer? I realize Pickle is default but I want to know for if I use one of the other supported serializers as well.

spark.serializer is used to set Java serializer. For Python serializer use serializer argument of SparkContext
from pyspark.serializers import PickleSerializer
conf = SparkConf().set(...)
sc = SparkContext(conf=conf, serializer=PickleSerializer())
Once SparkContext is ready you can use it to initialize SparkSession explicitly:
spark = SparkSession(sc)
spark.sparkContext is sc
## True
or implicitly (it will use SparkContext.getOrCreate):
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
spark.sparkContext is sc
## True

Related

How to execute a stored procedure in Azure Databricks PySpark?

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried.
#initialize pyspark
import findspark
findspark.init('C:\Spark\spark-2.4.5-bin-hadoop2.7')
#import required modules
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import *
import pandas as pd
#Create spark configuration object
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
#Create spark context and sparksession
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
table = "dbo.test"
#read table data into a spark dataframe
jdbcDF = spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://localhost:1433;databaseName=Demo;integratedSecurity=true;") \
.option("dbtable", table) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
#show the data loaded into dataframe
#jdbcDF.show()
sqlQueries="execute testJoin"
resultDF=spark.sql(sqlQueries)
resultDF.show(resultDF.count(),False)
This doesn't work — how do I do it?
In case someone is still looking for a method on how to do this, it's possible to use the built-in jdbc-connector of you spark session. Following code sample will do the trick:
import msal
# Set url & credentials
jdbc_url = ...
tenant_id = ...
sp_client_id = ...
sp_client_secret = ...
# Write your SQL statement as a string
name = "Some passed value"
statement = f"""
EXEC Staging.SPR_InsertDummy
#Name = '{name}'
"""
# Generate an OAuth2 access token for service principal
authority = f"https://login.windows.net/{tenant_id}"
app = msal.ConfidentialClientApplication(sp_client_id, sp_client_secret, authority)
token = app.acquire_token_for_client(scopes="https://database.windows.net/.default")["access_token"]
# Create a spark properties object and pass the access token
properties = spark._sc._gateway.jvm.java.util.Properties()
properties.setProperty("accessToken", token)
# Fetch the driver manager from your spark context
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
# Create a connection object and pass the properties object
con = driver_manager.getConnection(jdbc_url, properties)
# Create callable statement and execute it
exec_statement = con.prepareCall(statement)
exec_statement.execute()
# Close connections
exec_statement.close()
con.close()
For more information and a similar method using SQL-user credentials to connect over JDBC, or on how to take return parameters, I'd suggest you take a look at this blogpost:
https://medium.com/delaware-pro/executing-ddl-statements-stored-procedures-on-sql-server-using-pyspark-in-databricks-2b31d9276811
Running a stored procedure through a JDBC connection from azure databricks is not supported as of now. But your options are:
Use a pyodbc library to connect and execute your procedure. But by using this library, it means that you will be running your code on the driver node while all your workers are idle. See this article for details.
https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark
Use a SQL table function rather than procedures. In a sense, you can use anything that you can use in the FORM clause of a SQL query.
Since you are in an azure environment, then using a combination of azure data factory (to execute your procedure) and azure databricks can help you to build pretty powerful pipelines.

Cannot import parse_url in pyspark

I have this sql query, for hiveql in pyspark:
spark.sql('SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df')
And I would like to translate into functional query like:
df.select(split(parse_url(col('page.viewed_page'), 'HOST')))
but when I import the parse_url function I get:
----> 1 from pyspark.sql.functions import split, parse_url
ImportError: cannot import name 'parse_url' from 'pyspark.sql.functions' (/usr/local/opt/apache-spark/libexec/python/pyspark/sql/functions.py)
Could you point me in the right direction to import the parse_url function.
Cheers
parse_url is a Hive UDF, so you need to enable Hive Support by while creating the SparkSession object()
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
Then your following query should work:
spark.sql('SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df')
If your Spark is <2.2:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
hivContext = HiveContext(sc)
query = 'SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df'
hivContext.sql(query) # this will work
sqlContext.sql(query) # this will not work
EDIT:
parse_url is a SparkSQL builtin from Spark v2.3. It's not available in pyspark.sql.functions as of yet (11/28/2020). You can still use it on a pyspark dataframe by using selectExpr like this:
df.selectExpr('parse_url(mycolumn, "HOST")')

Getting error whith sparkSession while using multiprocessing in PySpark

My code is as follows :
def processFiles(prcFile , spark:SparkSession):
print(prcFile)
app_id = spark.sparkContext.getConf().get('spark.app.id')
app_name = spark.sparkContext.getConf().get('spark.app.name')
print(app_id)
print(app_name)
def main(configPath,args):
config.read(configPath)
spark: SparkSession = pyspark.sql.SparkSession.builder.appName("multiprocessing").enableHiveSupport().getOrCreate()
mprc = multiprocessing.Pool(3)
lst=glob.glob(config.get('DIT_setup_config', 'prcDetails')+'prc_PrcId_[0-9].json')
mprc.map(processFiles,zip(lst, repeat(spark.newSession())))
Now I want to pass a new session of Spark (spark.newSession()) and process data accordingly, but I am getting an error that says :
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
Any help will be highly appreciable

Getting org.bson.BsonInvalidOperationException: Invalid state INITIAL while printing pyspark dataframe

I am able to connect with mongodb from my spark job, but when I try to view the data that is being loaded from the database I get the error mentioning in the title. I am using pyspark module of Apache Spark.
The code Snippet is:
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
import sys
print(sys.stdin.encoding, sys.stdout.encoding)
conf=SparkConf()
conf.set('spark.mongodb.input.uri','mongodb://127.0.0.1/github.users')
conf.set('spark.mongodb.output.uri','mongodb://127.0.0.1/github.users')
sc =SparkContext(conf=conf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
df = df.sort('followers', ascending = True)
df.take(1)

How to build a sparkSession in Spark 2.0 using pyspark?

I just got access to spark 2.0; I have been using spark 1.6.1 up until this point. Can someone please help me set up a sparkSession using pyspark (python)? I know that the scala examples available online are similar (here), but I was hoping for a direct walkthrough in python language.
My specific case: I am loading in avro files from S3 in a zeppelin spark notebook. Then building df's and running various pyspark & sql queries off of them. All of my old queries use sqlContext. I know this is poor practice, but I started my notebook with
sqlContext = SparkSession.builder.enableHiveSupport().getOrCreate().
I can read in the avros with
mydata = sqlContext.read.format("com.databricks.spark.avro").load("s3:...
and build dataframes with no issues. But once I start querying the dataframes/temp tables, I keep getting the "java.lang.NullPointerException" error. I think that is indicative of a translational error (e.g. old queries worked in 1.6.1 but need to be tweaked for 2.0). The error occurs regardless of query type. So I am assuming
1.) the sqlContext alias is a bad idea
and
2.) I need to properly set up a sparkSession.
So if someone could show me how this is done, or perhaps explain the discrepancies they know of between the different versions of spark, I would greatly appreciate it. Please let me know if I need to elaborate on this question. I apologize if it is convoluted.
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('abc').getOrCreate()
now to import some .csv file you can use
df=spark.read.csv('filename.csv',header=True)
As you can see in the scala example, Spark Session is part of sql module. Similar in python. hence, see pyspark sql module documentation
class pyspark.sql.SparkSession(sparkContext, jsparkSession=None) The
entry point to programming Spark with the Dataset and DataFrame API. A
SparkSession can be used create DataFrame, register DataFrame as
tables, execute SQL over tables, cache tables, and read parquet files.
To create a SparkSession, use the following builder pattern:
>>> spark = SparkSession.builder \
... .master("local") \
... .appName("Word Count") \
... .config("spark.some.config.option", "some-value") \
... .getOrCreate()
From here http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html
You can create a spark session using this:
>>> from pyspark.sql import SparkSession
>>> from pyspark.conf import SparkConf
>>> c = SparkConf()
>>> SparkSession.builder.config(conf=c)
spark = SparkSession.builder\
.master("local")\
.enableHiveSupport()\
.getOrCreate()
spark.conf.set("spark.executor.memory", '8g')
spark.conf.set('spark.executor.cores', '3')
spark.conf.set('spark.cores.max', '3')
spark.conf.set("spark.driver.memory",'8g')
sc = spark.sparkContext
Here's a useful Python SparkSession class I developed:
#!/bin/python
# -*- coding: utf-8 -*-
######################
# SparkSession class #
######################
class SparkSession:
# - Notes:
# The main object if Spark Context ('sc' object).
# All new Spark sessions ('spark' objects) are sharing the same underlying Spark context ('sc' object) into the same JVM,
# but for each Spark context the temporary tables and registered functions are isolated.
# You can't create a new Spark Context into another JVM by using 'sc = SparkContext(conf)',
# but it's possible to create several Spark Contexts into the same JVM by specifying 'spark.driver.allowMultipleContexts' to true (not recommended).
# - See:
# https://medium.com/#achilleus/spark-session-10d0d66d1d24
# https://stackoverflow.com/questions/47723761/how-many-sparksessions-can-a-single-application-have
# https://stackoverflow.com/questions/34879414/multiple-sparkcontext-detected-in-the-same-jvm
# https://stackoverflow.com/questions/39780792/how-to-build-a-sparksession-in-spark-2-0-using-pyspark
# https://stackoverflow.com/questions/47813646/sparkcontext-getorcreate-purpose?noredirect=1&lq=1
from pyspark.sql import SparkSession
spark = None # The Spark Session
sc = None # The Spark Context
scConf = None # The Spark Context conf
def _init(self):
self.sc = self.spark.sparkContext
self.scConf = self.sc.getConf() # or self.scConf = self.spark.sparkContext._conf
# Return the current Spark Session (singleton), otherwise create a new oneÒ
def getOrCreateSparkSession(self, master=None, appName=None, config=None, enableHiveSupport=False):
cmd = "self.SparkSession.builder"
if (master != None): cmd += ".master(" + master + ")"
if (appName != None): cmd += ".appName(" + appName + ")"
if (config != None): cmd += ".config(" + config + ")"
if (enableHiveSupport == True): cmd += ".enableHiveSupport()"
cmd += ".getOrCreate()"
self.spark = eval(cmd)
self._init()
return self.spark
# Return the current Spark Context (singleton), otherwise create a new one via getOrCreateSparkSession()
def getOrCreateSparkContext(self, master=None, appName=None, config=None, enableHiveSupport=False):
self.getOrCreateSparkSession(master, appName, config, enableHiveSupport)
return self.sc
# Create a new Spark session from the current Spark session (with isolated SQL configurations).
# The new Spark session is sharing the underlying SparkContext and cached data,
# but the temporary tables and registered functions are isolated.
def createNewSparkSession(self, currentSparkSession):
self.spark = currentSparkSession.newSession()
self._init()
return self.spark
def getSparkSession(self):
return self.spark
def getSparkSessionConf(self):
return self.spark.conf
def getSparkContext(self):
return self.sc
def getSparkContextConf(self):
return self.scConf
def getSparkContextConfAll(self):
return self.scConf.getAll()
def setSparkContextConfAll(self, properties):
# Properties example: { 'spark.executor.memory' : '4g', 'spark.app.name' : 'Spark Updated Conf', 'spark.executor.cores': '4', 'spark.cores.max': '4'}
self.scConf = self.scConf.setAll(properties) # or self.scConf = self.spark.sparkContext._conf.setAll()
# Stop (clears) the active SparkSession for current thread.
#def stopSparkSession(self):
# return self.spark.clearActiveSession()
# Stop the underlying SparkContext.
def stopSparkContext(self):
self.spark.stop() # Or self.sc.stop()
# Returns the active SparkSession for the current thread, returned by the builder.
#def getActiveSparkSession(self):
# return self.spark.getActiveSession()
# Returns the default SparkSession that is returned by the builder.
#def getDefaultSession(self):
# return self.spark.getDefaultSession()

Categories