Running SPARK with SQL file from python gives error - python

I am trying to call a .SQL file with hive queries from Python py file using SPARK. It gives Error -- AttributeError: 'Builder' object has no attribute 'SparkContext'
Looked at multiple posts with similar error and tried but none of them worked for me. Here is my code.
from pyspark import SparkContext, SparkConf, SQLContext
sc = SparkSession.SparkContext.getOrCreate()
with open("/apps/home/p1.sql") as fr:
query = fr.read()
results = sc.sql(query)
The p1.sql has sql queries. How to pass parameters to the sql file? what will be different if sql returns rows and does not return rows. New to SPARK. Appreciate if the answer gives the code lines. Thanks

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
with open("/apps/home/p1.sql") as fr:
query = fr.read()
results = spark.sql(query)
You can refer https://spark.apache.org/docs/latest/sql-getting-started.html#starting-point-sparksession

Related

How to execute a stored procedure in Azure Databricks PySpark?

I am able to execute a simple SQL statement using PySpark in Azure Databricks but I want to execute a stored procedure instead. Below is the PySpark code I tried.
#initialize pyspark
import findspark
findspark.init('C:\Spark\spark-2.4.5-bin-hadoop2.7')
#import required modules
from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import *
import pandas as pd
#Create spark configuration object
conf = SparkConf()
conf.setMaster("local").setAppName("My app")
#Create spark context and sparksession
sc = SparkContext.getOrCreate(conf=conf)
spark = SparkSession(sc)
table = "dbo.test"
#read table data into a spark dataframe
jdbcDF = spark.read.format("jdbc") \
.option("url", f"jdbc:sqlserver://localhost:1433;databaseName=Demo;integratedSecurity=true;") \
.option("dbtable", table) \
.option("driver", "com.microsoft.sqlserver.jdbc.SQLServerDriver") \
.load()
#show the data loaded into dataframe
#jdbcDF.show()
sqlQueries="execute testJoin"
resultDF=spark.sql(sqlQueries)
resultDF.show(resultDF.count(),False)
This doesn't work — how do I do it?
In case someone is still looking for a method on how to do this, it's possible to use the built-in jdbc-connector of you spark session. Following code sample will do the trick:
import msal
# Set url & credentials
jdbc_url = ...
tenant_id = ...
sp_client_id = ...
sp_client_secret = ...
# Write your SQL statement as a string
name = "Some passed value"
statement = f"""
EXEC Staging.SPR_InsertDummy
#Name = '{name}'
"""
# Generate an OAuth2 access token for service principal
authority = f"https://login.windows.net/{tenant_id}"
app = msal.ConfidentialClientApplication(sp_client_id, sp_client_secret, authority)
token = app.acquire_token_for_client(scopes="https://database.windows.net/.default")["access_token"]
# Create a spark properties object and pass the access token
properties = spark._sc._gateway.jvm.java.util.Properties()
properties.setProperty("accessToken", token)
# Fetch the driver manager from your spark context
driver_manager = spark._sc._gateway.jvm.java.sql.DriverManager
# Create a connection object and pass the properties object
con = driver_manager.getConnection(jdbc_url, properties)
# Create callable statement and execute it
exec_statement = con.prepareCall(statement)
exec_statement.execute()
# Close connections
exec_statement.close()
con.close()
For more information and a similar method using SQL-user credentials to connect over JDBC, or on how to take return parameters, I'd suggest you take a look at this blogpost:
https://medium.com/delaware-pro/executing-ddl-statements-stored-procedures-on-sql-server-using-pyspark-in-databricks-2b31d9276811
Running a stored procedure through a JDBC connection from azure databricks is not supported as of now. But your options are:
Use a pyodbc library to connect and execute your procedure. But by using this library, it means that you will be running your code on the driver node while all your workers are idle. See this article for details.
https://datathirst.net/blog/2018/10/12/executing-sql-server-stored-procedures-on-databricks-pyspark
Use a SQL table function rather than procedures. In a sense, you can use anything that you can use in the FORM clause of a SQL query.
Since you are in an azure environment, then using a combination of azure data factory (to execute your procedure) and azure databricks can help you to build pretty powerful pipelines.

Cannot import parse_url in pyspark

I have this sql query, for hiveql in pyspark:
spark.sql('SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df')
And I would like to translate into functional query like:
df.select(split(parse_url(col('page.viewed_page'), 'HOST')))
but when I import the parse_url function I get:
----> 1 from pyspark.sql.functions import split, parse_url
ImportError: cannot import name 'parse_url' from 'pyspark.sql.functions' (/usr/local/opt/apache-spark/libexec/python/pyspark/sql/functions.py)
Could you point me in the right direction to import the parse_url function.
Cheers
parse_url is a Hive UDF, so you need to enable Hive Support by while creating the SparkSession object()
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
Then your following query should work:
spark.sql('SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df')
If your Spark is <2.2:
from pyspark import SparkContext
from pyspark.sql import HiveContext, SQLContext
sc = SparkContext()
sqlContext = SQLContext(sc)
hivContext = HiveContext(sc)
query = 'SELECT split(parse_url(page.viewed_page, "PATH"), "/")[1] as path FROM df'
hivContext.sql(query) # this will work
sqlContext.sql(query) # this will not work
EDIT:
parse_url is a SparkSQL builtin from Spark v2.3. It's not available in pyspark.sql.functions as of yet (11/28/2020). You can still use it on a pyspark dataframe by using selectExpr like this:
df.selectExpr('parse_url(mycolumn, "HOST")')

Getting error whith sparkSession while using multiprocessing in PySpark

My code is as follows :
def processFiles(prcFile , spark:SparkSession):
print(prcFile)
app_id = spark.sparkContext.getConf().get('spark.app.id')
app_name = spark.sparkContext.getConf().get('spark.app.name')
print(app_id)
print(app_name)
def main(configPath,args):
config.read(configPath)
spark: SparkSession = pyspark.sql.SparkSession.builder.appName("multiprocessing").enableHiveSupport().getOrCreate()
mprc = multiprocessing.Pool(3)
lst=glob.glob(config.get('DIT_setup_config', 'prcDetails')+'prc_PrcId_[0-9].json')
mprc.map(processFiles,zip(lst, repeat(spark.newSession())))
Now I want to pass a new session of Spark (spark.newSession()) and process data accordingly, but I am getting an error that says :
Exception: It appears that you are attempting to reference
SparkContext from a broadcast variable, action, or transformation.
SparkContext can only be used on the driver, not in code that it run
on workers. For more information, see SPARK-5063.
Any help will be highly appreciable

Getting org.bson.BsonInvalidOperationException: Invalid state INITIAL while printing pyspark dataframe

I am able to connect with mongodb from my spark job, but when I try to view the data that is being loaded from the database I get the error mentioning in the title. I am using pyspark module of Apache Spark.
The code Snippet is:
from pyspark import SparkConf,SparkContext
from pyspark.sql import SQLContext
import sys
print(sys.stdin.encoding, sys.stdout.encoding)
conf=SparkConf()
conf.set('spark.mongodb.input.uri','mongodb://127.0.0.1/github.users')
conf.set('spark.mongodb.output.uri','mongodb://127.0.0.1/github.users')
sc =SparkContext(conf=conf)
sqlContext =SQLContext(sc)
df = sqlContext.read.format("com.mongodb.spark.sql.DefaultSource").load()
df.printSchema()
df = df.sort('followers', ascending = True)
df.take(1)

Unable to gather data using PySpark connection with Hive

I am currently trying to run queries via PySpark. All went well with the connection and accessing the database. Unfortunately, when I run a query; the only output that is displayed are the column names followed by None.
I read through the documentation but could not find any answers. Posted Below is how I accessed the Database.
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
print ("Successfully imported Spark Modules")
except ImportError as e:
print ("Can not import Spark Modules", e)
sys.exit(1)
sc = SparkContext('local', 'pyspark')
sqlctx = SQLContext(sc)
df = sqlctx.read.format("jdbc").option("url", "jdbc:hive2://.....").option("dbtable", "(SELECT * FROM dtable LIMIT 10) df").load()
print df.show()
The output of df.show() is just the column names. When I run the same query using Pyhive there is data that is populated, so I assumed it has to do something with the way I am trying to load the data table using PySpark.
Thanks!

Categories