How to execute hql script with transform python udf in spark?

How to execute hql script with transform python udf in spark? - python

I am new to spark and learning thru POC. As part of this POC I am trying to execute hql file directly which has transform keyword to use python udf.
I have tested hql script in CLI "hive -f filename.hql" and it is working fine.
Same script I have tried in spark-sql but it is failing with hdfs path not found error. I tried to give hdfs path in different way as below but all are not working
"/test/scripts/test.hql"
"hdfs://test.net:8020/test/scripts/test.hql"
"hdfs:///test.net:8020/test/scripts/test.hql"
Also tried giving complete path in hive transform code as below
USING "scl enable python27 'python hdfs://test.net:8020/user/test/scripts/TestPython.py'"
Hive Code
add file hdfs://test.net:8020/user/test/scripts/TestPython.py;
select * from
(select transform (*)
USING "scl enable python27 'python TestPython.py'"
as (Col_1 STRING,
col_2 STRING,
...
..
col_125 STRING
)
FROM
test.transform_inner_temp1 a) b;
TestPython code:
#!/usr/bin/env python
'''
Created on June 2, 2017
#author: test
'''
import sys
from datetime import datetime
import decimal
import string
D = decimal.Decimal
for line in sys.stdin:
line = sys.stdin.readline()
TempList = line.strip().split('\t')
col_1 = TempList[0]
...
....
col_125 = TempList[34] + TempList[32]
outList.extend((col_1,....col_125))
outValue = "\t".join(map(str,outList))
print "%s"%(outValue)
So I have tried another method as executing directly in spark-submit
spark-submit --master yarn-cluster hdfs://test.net:8020/user/test/scripts/testspark.py
testspark.py
from pyspark.sql.types import StringType
from pyspark import SparkConf, SparkContext
from pyspark import SQLContext
conf = SparkConf().setAppName("gveeran pyspark test")
sc = SparkContext(conf=conf)
sqlContext = SQLContext(sc)
with open("hdfs://test.net:8020/user/test/scripts/test.hql") as fr:
query = fr.read()
results = sqlContext.sql(query)
results.show()
But again same issue as below
Traceback (most recent call last):
File "PySparkTest2.py", line 7, in <module>
with open("hdfs://test.net:8020/user/test/scripts/test.hql") as fr:
IOError: [Errno 2] No such file or directory: 'hdfs://test.net:8020/user/test/scripts/test.hql'

You can read the file as a query and then execute as spark sql job
Example:-
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
sc =SparkContext.getOrCreate()
sqlCtx = SQLContext(sc)
with open("/home/hadoop/test/abc.hql") as fr:
query = fr.read()
print(query)
results = sqlCtx.sql(query)

Related

Unable to load S3-hosted CSV into Spark Dataframe on Jupyter Notebook

Unable to load S3-hosted CSV into Spark Dataframe on Jupyter Notebook.
I believe I uploaded the 2 required packages with the os.environ line below. If I did it incorrectly please show me how to correctly install it. The Jupyter Notebook is hosted on an EC2 instance, which is why I'm trying to pull the CSV from a S3 bucket.
Here is my code:
import os
import pyspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
spark
Output:
Then i do:
%%time
df = spark.read.csv(f"s3://{AWS_BUCKET}/prices/{ts}_all_prices_{user}.csv", inferSchema = True, header = True)
And i get an error of:
WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3://blah-blah-blah/prices/1655999784.356597_blah_blah_blah.csv.
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for scheme "s3"

Here is an example using s3a.
import os
import pyspark
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk-pom:1.10.34,org.apache.hadoop:hadoop-aws:2.7.2 pyspark-shell'
from pyspark.sql import SQLContext
from pyspark import SparkContext
sc = SparkContext()
sqlContext = SQLContext(sc)
filePath = "s3a://yourBucket/yourFile.parquet"
df = sqlContext.read.parquet(filePath) # Parquet file read example
Here is a more complete example taken from here to build the spark session with required config.
from pyspark.sql import SparkSession
def get_spark():
spark = SparkSession.builder.master("local[4]").appName('SparkDelta') \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.jars.packages",
"io.delta:delta-core_2.12:1.1.0,"
"org.apache.hadoop:hadoop-aws:3.2.2,"
"com.amazonaws:aws-java-sdk-bundle:1.12.180") \
.getOrCreate()
# This is mandate config on spark session to use AWS S3
spark._jsc.hadoopConfiguration().set("com.amazonaws.services.s3.enableV4", "true")
spark._jsc.hadoopConfiguration().set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider","com.amazonaws.auth.InstanceProfileCredentialsProvider,com.amazonaws.auth.DefaultAWSCredentialsProviderChain")
spark._jsc.hadoopConfiguration().set("fs.AbstractFileSystem.s3a.impl", "org.apache.hadoop.fs.s3a.S3A")
# spark.sparkContext.setLogLevel("DEBUG")
return spark

Try using s3a as protocol, you can find more details here.
df = spark.read.csv(f"s3a://{AWS_BUCKET}/prices/{ts}_all_prices_{user}.csv", inferSchema = True, header = True)

How to load sql file in spark using python

My pySpark version is 2.4 and python version is 2.7. I have multiple line sql file which needs to run in spark. Instead of running line by line, is it possible to keep the sql file in python (which initialize spark) and execute it using spark-submit ? I am trying to write a generic script in python so that we just need to replace sql file from hdfs folder later. Below is my code snippet.
import sys
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
args = str(sys.argv[1]).split(',')
fd = args[0]
ld = args[1]
sd = args[2]
#Below line does not work
df = open("test.sql")
query = df.read().format(fd,ld,sd)
#Initiating SparkSession.
spark = SparkSession.builder.appName("PC").enableHiveSupport().getOrCreate()
#Below line works fine
df_s=spark.sql("""select * from test_tbl where batch_date='2021-08-01'""")
#Execute the sql (Does not work now)
df_s1=spark.sql(query)
spark-submit throws following error for the above code.
Exception in thread "main" org.apache.spark.SparkException:
Application application_1643050700073_7491 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1158)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1606)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:847)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:922)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:931)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 22/02/10 01:24:52 INFO util.ShutdownHookManager: Shutdown hook called
I am relatively new in pyspark. Can anyone please guide me what I am missing here ?

You cant run pyspark on your local directory. If you want to do an sql statement on a File in HDFS, you have to put your file from HDFS, first on your local directory.
Referred to spark 2.4.0 Spark Documentation, you can simply use the pyspark API.
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark.sql("YOUR QUERY").show()
or query files directly with:
df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

Getting objects from S3 bucket using PySpark

I'm trying to get JSON objects from an S3 bucket using PySpark (on Windows, using wsl2 terminal).
I can do this using boto3 as an intermediate step but, when I try to use the spark.read.json method, I get an error.
Code:
import findspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
import os
import multiprocessing
#----------------APACHE CONFIGURATIONS--------------
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
#---------------spark--------------
conf = (
SparkConf()
.set('spark.executor.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.driver.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.setAppName('pyspark_aws')
.setMaster(f"local[{multiprocessing.cpu_count()}]")
.setIfMissing("spark.executor.memory", "2g")
)
sc=SparkContext(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
spark=SparkSession(sc)
#--------------hadoop--------------
accessKeyId='xxxxxxxxxxxx'
secretAccessKey='xxxxxxxxx'
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', accessKeyId)
hadoopConf.set('fs.s3a.secret.key', secretAccessKey)
hadoopConf.set('fs.s3a.endpoint', 's3-eu-west-1.amazonaws.com')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
hadoopConf.set('fs.s3a.multipart.size', '419430400')
hadoopConf.set('fs.s3a.multipart.threshold', '2097152000')
hadoopConf.set('fs.s3a.connection.maximum', '500')
hadoopConf.set('s3a.connection.timeout', '600000')
s3_df = spark.read.json('s3a://{bucket}/{directory}/{object}.json')
Error:
py4j.protocol.Py4JJavaError: An error occurred while call
: java.lang.NumberFormatException: For input string: "32M
at java.base/java.lang.NumberFormatException.forI
at java.base/java.lang.Long.parseLong(Long.java:6
at java.base/java.lang.Long.parseLong(Long.java:8
at org.apache.hadoop.conf.Configuration.getLong(C
at org.apache.hadoop.fs.s3a.S3AFileSystem.getDefa
at org.apache.hadoop.fs.FileSystem.getDefaultBloc
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
at org.apache.hadoop.fs.FileSystem.exists(FileSys
at org.apache.spark.sql.execution.datasources.Dat
at org.apache.spark.sql.execution.datasources.Dat
at org.apache.spark.util.ThreadUtils$.$anonfun$pa
at java.base/java.util.concurrent.ForkJoinTask$Ruava.util.coteAction.exec(ForkJoinTask.java:1426)ncurrent.Fojava.base/java.util.concurrent.ForkJoinTask.dorkJoinWorkejava.base/java.util.concurrent.ForkJoinPool$WorThread.runjava.base/java.util.concurrent.ForkJoinPool.sc(ForkJoinWojava.base/java.util.concurrent.ForkJoinPool.rurkerThread.java.base/java.util.concurrent.ForkJoinWorkerTjava:183)
I added the multipart.size, multipart.threshold, connection.maximum, connection.timeout hadoop conf settings when I was getting a similar error earlier (this earlier error had '64M' instead of '32M' and changed when I added these conf settings)
I'm new to Spark so any and all tips/pointers would be helpful!

if needed
the "32M" is the default of "fs.s3a.block.size"
try hadoopConf.set('fs.s3a.block.size', '33554432')
go to https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
you will find the explanations of the "32M" and the "64M"

load external libraries inside pyspark code

I have a spark cluster I use in local mode. I want to read a csv with the databricks external library spark.csv. I start my app as follows:
import os
import sys
os.environ["SPARK_HOME"] = "/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6"
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
from pyspark import SparkContext, SparkConf, SQLContext
try:
sc
except NameError:
print('initializing SparkContext...')
sc=SparkContext()
sq = SQLContext(sc)
df = sq.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("/my/path/to/my/file.csv")
When I run it, I get the following error:
java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.
My question: how can I load the databricks.spark.csv library INSIDE my python code. I don't want to load it from outside (using --packages) from instance.
I tried to add the following lines but it did not work:
os.environ["SPARK_CLASSPATH"] = '/home/mebuddy/Programs/spark_lib/spark-csv_2.11-1.3.0.jar'

If you create SparkContext from scratch you can for example set PYSPARK_SUBMIT_ARGS before SparkContext is intialized:
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages com.databricks:spark-csv_2.11:1.3.0 pyspark-shell"
)
sc = SparkContext()
If for some reason you expect that SparkContext has been already initialized, as it is suggested by your code, this won't work. In local mode you could try to use Py4J gateway and URLClassLoader but it doesn't look like a good idea and won't work in a cluster mode.

Apache pyspark using oracle jdbc to pull data. Driver cannot be found

I am using apache spark pyspark (spark-1.5.2-bin-hadoop2.6) on windows 7.
I keep getting this error when I run my python script in pyspark.
An error occured while calling o23.load. java.sql.SQLException: No suitable driver found for jdbc:oracle:thin:------------------------------------connection
Here is my python file
import os
os.environ["SPARK_HOME"] = "C:\\spark-1.5.2-bin-hadoop2.6"
os.environ["SPARK_CLASSPATH"] = "L:\\Pyspark_Snow\\ojdbc6.jar"
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
df = (sqlContext
.load(source="jdbc",
url="jdbc:oracle:thin://x.x.x.x/xdb?user=xxxxx&password=xxxx",
dbtable="x.users")
)
sc.stop()

Unfortunately changing environment variable SPARK_CLASSPATH won't work. You need to declare
spark.driver.extraClassPath L:\\Pyspark_Snow\\ojdbc6.jar
in your /path/to/spark/conf/spark-defaults.conf or simply execute spark-submit job with additional argument --jars:
spark-submit --jars "L:\\Pyspark_Snow\\ojdbc6.jar" yourscript.py

To set the jars programatically set the following config:
spark.yarn.dist.jars with comma-separated list of jars.
Eg:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName("Spark config example") \
.config("spark.yarn.dist.jars", "<path-to-jar/test1.jar>,<path-to-jar/test2.jar>") \
.getOrCreate()
Or as below:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
spark_config = SparkConf().setMaster("local[8]")
spark_config.set("spark.yarn.dist.jars", "L:\\Pyspark_Snow\\ojdbc6.jar")
sc = SparkContext(conf=spark_config)
sqlContext = SQLContext(sc)
Or pass --jars with the path of jar files separated by , to spark-submit.

You can also add the jar using --jars and --driver-class-path and then set the driver specifically. See https://stackoverflow.com/a/36328672/1547734

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to execute hql script with transform python udf in spark? - python

Related

Unable to load S3-hosted CSV into Spark Dataframe on Jupyter Notebook

How to load sql file in spark using python

Getting objects from S3 bucket using PySpark

load external libraries inside pyspark code

Apache pyspark using oracle jdbc to pull data. Driver cannot be found

Categories

Resources