pyspark using mysql database on remote machine - python

I am using python 2.7 with ubuntu and running spark via a python script using a sparkcontext
My db is a remote mysql, with a username and password.
I try to query it using this code
sc = createSparkContext()
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(url='jdbc:mysql://ip:port?user=user&password=password', dbtable='(select * from tablename limit 100) as tablename').load()
print df.head()
And get this error
py4j.protocol.Py4JJavaError: An error occurred while calling o32.load.
: java.sql.SQLException: No suitable driver
I found that I need the JDBC driver for mysql.
I downloaded the platform free one from here
I tried including it using this code in starting the spark context
conf.set("spark.driver.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
and tried to install it using
sudo apt-get install libmysql-java
on the master machine, on the db machine and on the machine running the python script with no luck.
edit2
#
i tried using
conf.set("spark.executor.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
seems by the output of
print sc.getConf().getAll()
which is
[(u'spark.driver.memory', u'3G'), (u'spark.executor.extraClassPath',
u'file:///var/nfs/general/mysql-connector-java-5.1.43.jar'),
(u'spark.app.name', u'spark-basic'), (u'spark.app.id',
u'app-20170830'), (u'spark.rdd.compress', u'True'),
(u'spark.master', u'spark://127.0.0.1:7077'), (u'spark.driver.port',
u''), (u'spark.serializer.objectStreamReset', u'100'),
(u'spark.executor.memory', u'2G'), (u'spark.executor.id', u'driver'),
(u'spark.submit.deployMode', u'client'), (u'spark.driver.host',
u''), (u'spark.driver.cores', u'3')]
that it includes the correct path, but still i get the same "no driver" error...
What am i missing here?
Thanks

You need to set the classpath for both driver and worker nodes. Add the following to spark configuration
conf.set("spark.executor.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
conf.set("spark.driver.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
Or you can pass it using
import os
os.environ['SPARK_CLASSPATH'] = "/path/to/driver/mysql.jar"
For spark >=2.0.0 you can add the comma separated list of jars to spark-defaults.conf file located in spark_home/conf directory like this
spark.jars path_2_jar1,path_2_jar2

from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("Word Count")\
.config("spark.driver.extraClassPath", "/home/tuhin/mysql.jar")\
.getOrCreate()
dataframe_mysql = spark.read\
.format("jdbc")\
.option("url", "jdbc:mysql://ip:port/db_name")\
.option("driver", "com.mysql.jdbc.Driver")\
.option("dbtable", "employees").option("user", "root")\
.option("password", "12345678").load()
print(dataframe_mysql.columns)
"/home/tuhin/mysql.jar" is the location of mysql jar file

Related

How to load sql file in spark using python

My pySpark version is 2.4 and python version is 2.7. I have multiple line sql file which needs to run in spark. Instead of running line by line, is it possible to keep the sql file in python (which initialize spark) and execute it using spark-submit ? I am trying to write a generic script in python so that we just need to replace sql file from hdfs folder later. Below is my code snippet.
import sys
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
args = str(sys.argv[1]).split(',')
fd = args[0]
ld = args[1]
sd = args[2]
#Below line does not work
df = open("test.sql")
query = df.read().format(fd,ld,sd)
#Initiating SparkSession.
spark = SparkSession.builder.appName("PC").enableHiveSupport().getOrCreate()
#Below line works fine
df_s=spark.sql("""select * from test_tbl where batch_date='2021-08-01'""")
#Execute the sql (Does not work now)
df_s1=spark.sql(query)
spark-submit throws following error for the above code.
Exception in thread "main" org.apache.spark.SparkException:
Application application_1643050700073_7491 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1158)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1606)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:847)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:922)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:931)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 22/02/10 01:24:52 INFO util.ShutdownHookManager: Shutdown hook called
I am relatively new in pyspark. Can anyone please guide me what I am missing here ?
You cant run pyspark on your local directory. If you want to do an sql statement on a File in HDFS, you have to put your file from HDFS, first on your local directory.
Referred to spark 2.4.0 Spark Documentation, you can simply use the pyspark API.
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark.sql("YOUR QUERY").show()
or query files directly with:
df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

`AnalysisException("Database 'delta' not found" ...` when creating delta table over DeltaTableBuilder API locally

I am creating delta tables in Databricks Runtime 9.1 LTS using the DeltaTableBuilder API in PySpark. This works fine. When running the same code locally (for a unit test) I get a strange error.
Im a setting up my local environment and the local PySpark Session as described in the Quickstart guide.
Steps to reproduce:
Setup environment as in DB Runtime 9.1: Python 3.8
pip install delta-spark==1.0.0
pip install pyspark==3.1.2
Create SparkSession:
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
session_builder = SparkSession.builder
session_builder = (
session_builder.config("spark.jars.packages", "io.delta:delta-core_2.12:1.0.0")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
)
spark = configure_spark_with_delta_pip(session_builder).getOrCreate()
Try to create delta table:
from delta import DeltaTable
DeltaTable.createIfNotExists(spark) \
.location("test_tables/test.delta") \
.addColumn("id", "LONG") \
.execute()
Error:
pyspark.sql.utils.AnalysisException: Database 'delta' not found
In Databricks this error does not appear. It does not require any database "delta" it just creates the delta table directory with the delta_log in it - no database involved. But ok lets create the database locally:
from delta import DeltaTable
spark.sql("CREATE DATABASE delta")
DeltaTable.createOrReplace(spark) \
.location("test_delta") \
.addColumn("id", "LONG") \
.execute()
Error:
pyspark.sql.utils.AnalysisException: `delta`.`test_delta` is not a Delta table.
How can I make the behaviour locally as in Databricks?

Exception: Java gateway process exited before sending its port number with pyspark

I am working with python and pyspark in a jupyter notebook. I am trying to read several parquet files from an aws s3 bucket and convert them into a single json file.
This is what I have:
from functools import reduce
from pyspark.sql import DataFrame
bucket = s3.Bucket(name='mybucket')
keys =[]
for key in bucket.objects.all():
keys.append(key.key)
print(keys[0])
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
.master('local') \
.appName('myAppName') \
.config('spark.executor.memory', '5gb') \
.config("spark.cores.max", "6") \
.getOrCreate()
sc = spark.sparkContext
But I am getting:
Exception: Java gateway process exited before sending its port number with pyspark
I am not sure how to fix this, thank you!
Your getting this error because your pyspark is not able to communicate with your cluster. you need to set the value of some global variable like this.
import os
import findspark
findspark.init()
os.environ['PYSPARK_SUBMIT_ARGS'] = """--name job_name --master yarn / local
--conf spark.dynamicAllocation.enabled=true
pyspark-shell"""
os.environ['PYSPARK_PYTHON'] = "python3.6" # what ever version of python your using
os.environ['python'] = "python3.6"
findspark package is optional but it's good to use in case of pyspark.

AWS Glue - Truncate destination postgres table prior to insert

I am trying to truncate a postgres destination table prior to insert, and in general, trying to fire external functions utilizing the connections already created in GLUE.
Has anyone been able to do so?
I've tried the DROP/ TRUNCATE scenario, but have not been able to do it with connections already created in Glue, but with a pure Python PostgreSQL driver, pg8000.
Download the tar of pg8000 from pypi
Create an empty __init__.py in the root folder
Zip up the contents & upload to S3
Reference the zip file in the Python lib path of the job
Set the DB connection details as job params (make sure to prepend all key names with --). Tick the "Server-side encryption" box.
Then you can simply create a connection and execute SQL.
import sys
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.dynamicframe import DynamicFrame
from awsglue.job import Job
import pg8000
args = getResolvedOptions(sys.argv, [
'JOB_NAME',
'PW',
'HOST',
'USER',
'DB'
])
# ...
# Create Spark & Glue context
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# ...
config_port = 5432
conn = pg8000.connect(
database=args['DB'],
user=args['USER'],
password=args['PW'],
host=args['HOST'],
port=config_port
)
query = "TRUNCATE TABLE {0};".format(".".join([schema, table]))
cur = conn.cursor()
cur.execute(query)
conn.commit()
cur.close()
conn.close()
After following step (4) of #thenaturalist's response,
sc.addPyFile("/home/glue/downloads/python/pg8000.zip")
import pg8000
worked for me in a development endpoint (zeppelin notebook)
More info: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html
To clarify #thenaturalist instructions for the zip as I still struggled with this
Download the tar.gz of pg8000 from pypi.org and extract.
Zip the contents so you have the below structure
pg8000-1.15.3.zip
|
| -- pg8000 <dir>
| -- __init__.py
| -- _version.py <optional>
| -- core.py
Upload to s3 and then you should be able to just do a simple import pg8000.
NOTE: scramp is also required at the moment so follow the same procedure as above to include the scramp module. But you don't need to import it.
data=spark.sql(sql)
conf = glueContext.extract_jdbc_conf("jdbc-commerce")
data.write \
.mode('overwrite') \
.format("jdbc") \
.option("url", conf['url']) \
.option("database", 'Pacvue_Commerce') \
.option("dbtable", "dbo.glue_1") \
.option("user", conf['user']) \
.option('truncate','true') \
.option("password", conf['password']) \
.save()
glue api not support , but spark api support.
jdbc-commerce is your connection name at crawl.
use extract_jdbc_conf to get url、username and password.

how to load --jars with pyspark with spark standalone on client mode

I am using python 2.7 with spark standalone cluster on client mode.
I want to use jdbc for mysql and found that i need to load it using --jars argument, I have the jdbc on my local, and manage to load it with pyspark console like here
When I write a python script inside my ide, using pyspark, I don't manage to load the additional jar mysql-connector-java-5.1.26.jar and keep get
no suitable driver
error
How can I load additional jar files when running a python script in client mode, using a standalone cluster on client mode and refering to a remote master?
edit: added some code #########################################################################
this is the basic code that i am using, i use pyspark with spark context in python e.g i do not use spark submit directly and don't understand how to use spark submit parameters in this case...
def createSparkContext(masterAdress = algoMaster):
"""
:return: return a spark context that is suitable for my configs
note the ip for the master
app name is not that important, just to show off
"""
from pyspark.mllib.util import MLUtils
from pyspark import SparkConf
from pyspark import SparkContext
import os
SUBMIT_ARGS = "--driver-class-path /var/nfs/general/mysql-connector-java-5.1.43 pyspark-shell"
#SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
#conf.set("spark.driver.extraClassPath", "var/nfs/general/mysql-connector-java-5.1.43")
conf.setMaster(masterAdress)
conf.setAppName('spark-basic')
conf.set("spark.executor.memory", "2G")
#conf.set("spark.executor.cores", "4")
conf.set("spark.driver.memory", "3G")
conf.set("spark.driver.cores", "3")
#conf.set("spark.driver.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
sc = SparkContext(conf=conf)
print sc._conf.get("spark.executor.extraClassPath")
return sc
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(url='jdbc:mysql://ip:port?user=user&password=pass', dbtable='(select * from tablename limit 100) as tablename').load()
print df.head()
Thanks
Your SUBMIT_ARGS is going to be passed to the spark-submit when creating a sparkContext from python. You should use --jars instead of --driver-class-path.
EDIT
Your problem is actually a lot simpler than it seems: you're missing the parameter driver in the options:
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(
url='jdbc:mysql://ip:port',
user='user',
password='pass',
driver="com.mysql.jdbc.Driver",
dbtable='(select * from tablename limit 100) as tablename'
).load()
You can also put userand password in separate arguments.

Categories