How to load sql file in spark using python - python

My pySpark version is 2.4 and python version is 2.7. I have multiple line sql file which needs to run in spark. Instead of running line by line, is it possible to keep the sql file in python (which initialize spark) and execute it using spark-submit ? I am trying to write a generic script in python so that we just need to replace sql file from hdfs folder later. Below is my code snippet.
import sys
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
args = str(sys.argv[1]).split(',')
fd = args[0]
ld = args[1]
sd = args[2]
#Below line does not work
df = open("test.sql")
query = df.read().format(fd,ld,sd)
#Initiating SparkSession.
spark = SparkSession.builder.appName("PC").enableHiveSupport().getOrCreate()
#Below line works fine
df_s=spark.sql("""select * from test_tbl where batch_date='2021-08-01'""")
#Execute the sql (Does not work now)
df_s1=spark.sql(query)
spark-submit throws following error for the above code.
Exception in thread "main" org.apache.spark.SparkException:
Application application_1643050700073_7491 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1158)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1606)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:847)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:922)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:931)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 22/02/10 01:24:52 INFO util.ShutdownHookManager: Shutdown hook called
I am relatively new in pyspark. Can anyone please guide me what I am missing here ?

You cant run pyspark on your local directory. If you want to do an sql statement on a File in HDFS, you have to put your file from HDFS, first on your local directory.
Referred to spark 2.4.0 Spark Documentation, you can simply use the pyspark API.
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark.sql("YOUR QUERY").show()
or query files directly with:
df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

Related

Getting objects from S3 bucket using PySpark

I'm trying to get JSON objects from an S3 bucket using PySpark (on Windows, using wsl2 terminal).
I can do this using boto3 as an intermediate step but, when I try to use the spark.read.json method, I get an error.
Code:
import findspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
import os
import multiprocessing
#----------------APACHE CONFIGURATIONS--------------
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
#---------------spark--------------
conf = (
SparkConf()
.set('spark.executor.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.driver.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.setAppName('pyspark_aws')
.setMaster(f"local[{multiprocessing.cpu_count()}]")
.setIfMissing("spark.executor.memory", "2g")
)
sc=SparkContext(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
spark=SparkSession(sc)
#--------------hadoop--------------
accessKeyId='xxxxxxxxxxxx'
secretAccessKey='xxxxxxxxx'
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', accessKeyId)
hadoopConf.set('fs.s3a.secret.key', secretAccessKey)
hadoopConf.set('fs.s3a.endpoint', 's3-eu-west-1.amazonaws.com')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
hadoopConf.set('fs.s3a.multipart.size', '419430400')
hadoopConf.set('fs.s3a.multipart.threshold', '2097152000')
hadoopConf.set('fs.s3a.connection.maximum', '500')
hadoopConf.set('s3a.connection.timeout', '600000')
s3_df = spark.read.json('s3a://{bucket}/{directory}/{object}.json')
Error:
py4j.protocol.Py4JJavaError: An error occurred while call
: java.lang.NumberFormatException: For input string: "32M
at java.base/java.lang.NumberFormatException.forI
at java.base/java.lang.Long.parseLong(Long.java:6
at java.base/java.lang.Long.parseLong(Long.java:8
at org.apache.hadoop.conf.Configuration.getLong(C
at org.apache.hadoop.fs.s3a.S3AFileSystem.getDefa
at org.apache.hadoop.fs.FileSystem.getDefaultBloc
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
at org.apache.hadoop.fs.FileSystem.exists(FileSys
at org.apache.spark.sql.execution.datasources.Dat
at org.apache.spark.sql.execution.datasources.Dat
at org.apache.spark.util.ThreadUtils$.$anonfun$pa
at java.base/java.util.concurrent.ForkJoinTask$Ruava.util.coteAction.exec(ForkJoinTask.java:1426)ncurrent.Fojava.base/java.util.concurrent.ForkJoinTask.dorkJoinWorkejava.base/java.util.concurrent.ForkJoinPool$WorThread.runjava.base/java.util.concurrent.ForkJoinPool.sc(ForkJoinWojava.base/java.util.concurrent.ForkJoinPool.rurkerThread.java.base/java.util.concurrent.ForkJoinWorkerTjava:183)
I added the multipart.size, multipart.threshold, connection.maximum, connection.timeout hadoop conf settings when I was getting a similar error earlier (this earlier error had '64M' instead of '32M' and changed when I added these conf settings)
I'm new to Spark so any and all tips/pointers would be helpful!
if needed
the "32M" is the default of "fs.s3a.block.size"
try hadoopConf.set('fs.s3a.block.size', '33554432')
go to https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
you will find the explanations of the "32M" and the "64M"

Is there a way to read a csv file in hdfs into a python dataframe using insecureclient in a jupyter notebook?

I have a csv file located on hdfs in a remote server.
I want to read the csv file into a pandas dataframe using insecureclient, however I keep getting an error
1st attempt:
code:
from hdfs import InsecureClient
client_hdfs = InsecureClient('hdfs://host:port', user=user)
with client_hdfs.read('path/to/csv.csv') as reader:
print(reader)
error:
InvalidSchema: No connection adapters were found for
'host:port:path/to/csv.csv'
I have verified that the path is correct by running 'hdfs -ls path/to/csv.csv' on the server and viewing the file there and have obtained the host name by running 'uname -n' on the server.
2nd attempt (created a new test file with the contents "this is a test file" and placed in same hdfs location and tried to read):
from pyspark import SparkContext, SparkConf
conf = SparkConf().setAppName("myFirstApp").setMaster("local")
sc = SparkContext(conf=conf)
path = "hdfs://host:port/path/to/textfile.txt"
df = sc.textFile(path)
df.first()
Error:
runs indefinitely without ever returning result
The module you're using connects to WebHDFS (using http protocol), not HDFS Namenode port (hdfs:// protocol)
Docs - https://hdfscli.readthedocs.io/en/latest/quickstart.html#instantiating-a-client
PySpark, on the other hand, does allow connecting over hdfs:// + You should use Spark dataframes, not Pandas directly - https://databricks.com/blog/2021/10/04/pandas-api-on-upcoming-apache-spark-3-2.html

How to execute hql script with transform python udf in spark?

I am new to spark and learning thru POC. As part of this POC I am trying to execute hql file directly which has transform keyword to use python udf.
I have tested hql script in CLI "hive -f filename.hql" and it is working fine.
Same script I have tried in spark-sql but it is failing with hdfs path not found error. I tried to give hdfs path in different way as below but all are not working
"/test/scripts/test.hql"
"hdfs://test.net:8020/test/scripts/test.hql"
"hdfs:///test.net:8020/test/scripts/test.hql"
Also tried giving complete path in hive transform code as below
USING "scl enable python27 'python hdfs://test.net:8020/user/test/scripts/TestPython.py'"
Hive Code
add file hdfs://test.net:8020/user/test/scripts/TestPython.py;
select * from
(select transform (*)
USING "scl enable python27 'python TestPython.py'"
as (Col_1 STRING,
col_2 STRING,
...
..
col_125 STRING
)
FROM
test.transform_inner_temp1 a) b;
TestPython code:
#!/usr/bin/env python
'''
Created on June 2, 2017
#author: test
'''
import sys
from datetime import datetime
import decimal
import string
D = decimal.Decimal
for line in sys.stdin:
line = sys.stdin.readline()
TempList = line.strip().split('\t')
col_1 = TempList[0]
...
....
col_125 = TempList[34] + TempList[32]
outList.extend((col_1,....col_125))
outValue = "\t".join(map(str,outList))
print "%s"%(outValue)
So I have tried another method as executing directly in spark-submit
spark-submit --master yarn-cluster hdfs://test.net:8020/user/test/scripts/testspark.py
testspark.py
from pyspark.sql.types import StringType
from pyspark import SparkConf, SparkContext
from pyspark import SQLContext
conf = SparkConf().setAppName("gveeran pyspark test")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
with open("hdfs://test.net:8020/user/test/scripts/test.hql") as fr:
query = fr.read()
results = sqlContext.sql(query)
results.show()
But again same issue as below
Traceback (most recent call last):
File "PySparkTest2.py", line 7, in <module>
with open("hdfs://test.net:8020/user/test/scripts/test.hql") as fr:
IOError: [Errno 2] No such file or directory: 'hdfs://test.net:8020/user/test/scripts/test.hql'
You can read the file as a query and then execute as spark sql job
Example:-
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sc =SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)
with open("/home/hadoop/test/abc.hql") as fr:
query = fr.read()
print(query)
results = sqlCtx.sql(query)

how to load --jars with pyspark with spark standalone on client mode

I am using python 2.7 with spark standalone cluster on client mode.
I want to use jdbc for mysql and found that i need to load it using --jars argument, I have the jdbc on my local, and manage to load it with pyspark console like here
When I write a python script inside my ide, using pyspark, I don't manage to load the additional jar mysql-connector-java-5.1.26.jar and keep get
no suitable driver
error
How can I load additional jar files when running a python script in client mode, using a standalone cluster on client mode and refering to a remote master?
edit: added some code #########################################################################
this is the basic code that i am using, i use pyspark with spark context in python e.g i do not use spark submit directly and don't understand how to use spark submit parameters in this case...
def createSparkContext(masterAdress = algoMaster):
"""
:return: return a spark context that is suitable for my configs
note the ip for the master
app name is not that important, just to show off
"""
from pyspark.mllib.util import MLUtils
from pyspark import SparkConf
from pyspark import SparkContext
import os
SUBMIT_ARGS = "--driver-class-path /var/nfs/general/mysql-connector-java-5.1.43 pyspark-shell"
#SUBMIT_ARGS = "--packages com.databricks:spark-csv_2.11:1.2.0 pyspark-shell"
os.environ["PYSPARK_SUBMIT_ARGS"] = SUBMIT_ARGS
conf = SparkConf()
#conf.set("spark.driver.extraClassPath", "var/nfs/general/mysql-connector-java-5.1.43")
conf.setMaster(masterAdress)
conf.setAppName('spark-basic')
conf.set("spark.executor.memory", "2G")
#conf.set("spark.executor.cores", "4")
conf.set("spark.driver.memory", "3G")
conf.set("spark.driver.cores", "3")
#conf.set("spark.driver.extraClassPath", "/var/nfs/general/mysql-connector-java-5.1.43")
sc = SparkContext(conf=conf)
print sc._conf.get("spark.executor.extraClassPath")
return sc
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(url='jdbc:mysql://ip:port?user=user&password=pass', dbtable='(select * from tablename limit 100) as tablename').load()
print df.head()
Thanks
Your SUBMIT_ARGS is going to be passed to the spark-submit when creating a sparkContext from python. You should use --jars instead of --driver-class-path.
EDIT
Your problem is actually a lot simpler than it seems: you're missing the parameter driver in the options:
sql = SQLContext(sc)
df = sql.read.format('jdbc').options(
url='jdbc:mysql://ip:port',
user='user',
password='pass',
driver="com.mysql.jdbc.Driver",
dbtable='(select * from tablename limit 100) as tablename'
).load()
You can also put userand password in separate arguments.

load external libraries inside pyspark code

I have a spark cluster I use in local mode. I want to read a csv with the databricks external library spark.csv. I start my app as follows:
import os
import sys
os.environ["SPARK_HOME"] = "/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6"
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
from pyspark import SparkContext, SparkConf, SQLContext
try:
sc
except NameError:
print('initializing SparkContext...')
sc=SparkContext()
sq = SQLContext(sc)
df = sq.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("/my/path/to/my/file.csv")
When I run it, I get the following error:
java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.
My question: how can I load the databricks.spark.csv library INSIDE my python code. I don't want to load it from outside (using --packages) from instance.
I tried to add the following lines but it did not work:
os.environ["SPARK_CLASSPATH"] = '/home/mebuddy/Programs/spark_lib/spark-csv_2.11-1.3.0.jar'
If you create SparkContext from scratch you can for example set PYSPARK_SUBMIT_ARGS before SparkContext is intialized:
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages com.databricks:spark-csv_2.11:1.3.0 pyspark-shell"
)
sc = SparkContext()
If for some reason you expect that SparkContext has been already initialized, as it is suggested by your code, this won't work. In local mode you could try to use Py4J gateway and URLClassLoader but it doesn't look like a good idea and won't work in a cluster mode.

Categories