zeppelin pyspark how to connect remote spark?

zeppelin pyspark how to connect remote spark? - python

My zeppelin is using local spark now.
Got ValueError: Cannot run multiple SparkContexts at once when I tried to create a remote SparkContext .
Follow
multiple SparkContexts error in tutorial
write below code:
from pyspark import SparkConf, SparkContext
sc.stop()
conf = SparkConf().setAppName('train_etl').setMaster('spark://xxxx:7077')
sc = SparkContext(conf=conf)
Got another error:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6681108227268089746.py", line 363, in <module>
sc.setJobGroup(jobGroup, jobDesc)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 944, in setJobGroup
self._jsc.setJobGroup(groupId, description, interruptOnCancel)
AttributeError: 'NoneType' object has no attribute 'setJobGroup'
What should I do?

by default, Spark automatically creates the SparkContext object named sc, when the PySpark application started.you have to use below line in your code which
sc = SparkContext.getOrCreate()
Get the singleton SQLContext if it exists or create a new one using the given SparkContext.
This function can be used to create a singleton SQLContext object that can be shared across the JVM.
If there is an active SQLContext for current thread, it will be returned instead of the global one.

Enter http://zeppelin_host:zeppelin_port/#/interpreter
config parameter master of spark interpreter (which use for pyspark) to spark://xxxx:7077

Related

How to load sql file in spark using python

My pySpark version is 2.4 and python version is 2.7. I have multiple line sql file which needs to run in spark. Instead of running line by line, is it possible to keep the sql file in python (which initialize spark) and execute it using spark-submit ? I am trying to write a generic script in python so that we just need to replace sql file from hdfs folder later. Below is my code snippet.
import sys
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
args = str(sys.argv[1]).split(',')
fd = args[0]
ld = args[1]
sd = args[2]
#Below line does not work
df = open("test.sql")
query = df.read().format(fd,ld,sd)
#Initiating SparkSession.
spark = SparkSession.builder.appName("PC").enableHiveSupport().getOrCreate()
#Below line works fine
df_s=spark.sql("""select * from test_tbl where batch_date='2021-08-01'""")
#Execute the sql (Does not work now)
df_s1=spark.sql(query)
spark-submit throws following error for the above code.
Exception in thread "main" org.apache.spark.SparkException:
Application application_1643050700073_7491 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1158)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1606)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:847)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:161)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:184)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:922)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:931)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) 22/02/10 01:24:52 INFO util.ShutdownHookManager: Shutdown hook called
I am relatively new in pyspark. Can anyone please guide me what I am missing here ?

You cant run pyspark on your local directory. If you want to do an sql statement on a File in HDFS, you have to put your file from HDFS, first on your local directory.
Referred to spark 2.4.0 Spark Documentation, you can simply use the pyspark API.
from os.path import expanduser, join, abspath
from pyspark.sql import SparkSession
from pyspark.sql import Row
spark.sql("YOUR QUERY").show()
or query files directly with:
df = spark.sql("SELECT * FROM parquet.`examples/src/main/resources/users.parquet`")

How to run an AWS Glue Python Spark Job with the Current Version of boto3?

I'm trying to run the latest version of boto3 in an AWS Glue spark job to access methods that aren't available in the default version in Glue.
To get the default version of boto3 and verify the method I want to access isn't available I run this block of code which is all boilerplate except for my print statements:
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
athena = boto3.client('athena')
print(boto3.__version__) # verify the default version boto3 imports
print(athena.list_table_metadata) # method I want to verify I can access in Glue
job.commit()
which returns
1.12.4
Traceback (most recent call last): File "/tmp/another_sample", line 20, in print(athena.list_table_metadata) File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 566, in getattr self.class.name, item) AttributeError: 'Athena' object has no attribute 'list_table_metadata'
Ok, as expected with an older version of boto3. Let's try and import the latest version...
I perform the following steps:
Go to https://pypi.org/project/boto3/#files
Download the boto3-1.17.13-py2.py3-none-any.whl file
Place it in S3 location
Go back to the Glue Job and under the Security configuration, script libraries, and job parameters (optional) section I update the Python library path with the S3 location from step 3
Rerun block of code from above
which returns
1.17.9
Traceback (most recent call last): File "/tmp/another_sample", line 20, in print(athena.list_table_metadata) File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 566, in getattr self.class.name, item) AttributeError: 'Athena' object has no attribute 'list_table_metadata'
If I run this same script locally, which is running 1.17.9 I can find the method:
1.17.9
<bound method ClientCreator._create_api_method.._api_call of <botocore.client.Athena object at 0x7efd8a4f4710>>
Any ideas on what's going on here and how to access the methods that I would expect should be imported in the upgraded version?

Ended up finding a work-around solution in the AWS documentation.
Added the following Key/Value pair in the Glue Job parameters under the Security configuration, script libraries, and job parameters (optional) section of the job:
Key:
--additional-python-modules
Value:
botocore>=1.20.12,boto3>=1.17.12

Unable to create spark session

When I am creating a spark session, it is throwing an error
Unable to create a spark session
Using pyspark, the code snippet:
ValueError Traceback (most recent call last)
<ipython-input-13-2262882856df> in <module>()
37 if __name__ == "__main__":
38 conf = SparkConf()
---> 39 sc = SparkContext(conf=conf)
40 # print(sc.version)
41 # sc = SparkContext(conf=conf)
~/anaconda3/lib/python3.5/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
131 " note this option will be removed in Spark 3.0")
132
--> 133 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
134 try:
135 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
~/anaconda3/lib/python3.5/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
330 " created by %s at %s:%s "
331 % (currentAppName, currentMaster,
--> 332 callsite.function, callsite.file, callsite.linenum))
333 else:
334 SparkContext._active_spark_context = instance
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local[*]) created by __init__ at <ipython-input-7-edf43bdce70a>:33
imports
from pyspark import SparkConf, SparkContext
I tried this alternative approach, which fails too:
spark = SparkSession(sc).builder.appName("Detecting-Malicious-URL App").getOrCreate()
This is throwing another error as follows:
NameError: name 'SparkSession' is not defined

Try this -
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Detecting-Malicious-URL App").getOrCreate()
Before spark 2.0 we had to create a SparkConf and SparkContext to interact with Spark.
Whereas in Spark 2.0 SparkSession is the entry point to Spark SQL. Now we don't need to create SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession.
Please refer this blog for more details : How to use SparkSession in Apache Spark 2.0

spark context is used to connect to the cluster through a resource manager.
sparkConf is required to create the spark context object, which stores configuration parameter like appName (to identify your spark driver), application, number of core and memory size of executor running on worker node. To use APIs of Sql, Hive, Streaming separate contexts need to be created.
While as for SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with Dataframes and API's. It don't need to create a separate session to use Sql, Hive etc.
To create a SparkSession you might use the following builder
SparkSession.builder.master("local").appName("Detecting-Malicious-URL
App") .config("spark.some.config.option", "some-value")
To overcome this error
"NameError: name 'SparkSession' is not defined"
you might need to use a package calling such as
"from pyspark.sql import SparkSession"
pyspark.sql supports spark session which is used to create data frames or register data frames as tables etc.
And the above error
(ValueError: Cannot run multiple SparkContexts at once; existing
SparkContext(app=pyspark-shell, master=local[*]) created by init
at :33 )
you specified this might be helpful- ValueError: Cannot run multiple SparkContexts at once in spark with pyspark

Faced Similar problem on Win 10 Solved by following Way:-
go to the Conda prompt and run the following command:-
# Install OpenJDK 11
conda install openjdk
Then run the:-
SparkSession.builder.appName('...').getOrCreate()

No output operations registered, so nothing to execute in PySpark

I am trying to integtrate Spark with Kafka. I am having the kafka consumer have json data. I want to inegrate the kafka consumer with Spark for processing. When I run the below code error is throwing.
bin\spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 test.py localhost:9092 maktest
My test.py is below
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc,[topic],{"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
print (lines)
ssc.start()
ssc.awaitTermination()
I got the error below
18/12/10 16:41:40 INFO VerifiableProperties: Verifying properties
18/12/10 16:41:40 INFO VerifiableProperties: Property group.id is overridden to
18/12/10 16:41:40 INFO VerifiableProperties: Property zookeeper.connect is overridden to
<pyspark.streaming.kafka.KafkaTransformedDStream object at 0x000002A6DA9FE6A0>
18/12/10 16:41:40 ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:224)
Traceback (most recent call last):
File "C:/Users/maws/Desktop/spark-2.2.1-bin-hadoop2.7/test.py", line 12, in <module>
ssc.start()
py4j.protocol.Py4JJavaError: An error occurred while calling o25.start.
: java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute

You're not using a supported Spark Streaming DStream output operation.
For the pyspark API, you should use:
pprint()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
print() can't be used with pyspark, so make sure when you check other Spark Streaming Examples for Scala or Java, you change to pprint()

Pyspark - Using a custom function in map transform

I run the following file (called test_func.py) with py.test:
import findspark
findspark.init()
from pyspark.context import SparkContext
def filtering(data):
return data.map(lambda p: modif(p)).count()
def modif(row):
row.split(",")
class Test(object):
sc = SparkContext('local[1]')
def test_filtering(self):
data = self.sc.parallelize(['1','2', ''])
assert filtering(data) == 2
And, because the modif function is used inside the map transform, it fails with the following error:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
ImportError: No module named clustering.test_func
pyspark does not manage to find the modif function. Note that the file test_func.py is in the directory clustering and I run py.test from inside clustering directory.
What astonishes me is that if I use modif function outside a map, it works fine. For instance, if I do: modif(data.first())
Any idea why I get such importErrors and how I could fix it?
EDIT
I had already tested what has been proposed Avihoo Mamka's answer, i.e. to add test_func.py to the files copied to all workers. However, it had no effect. It is no surprise for me because I think the main file, from where the spark application is created, is always sent to all workers.
I think it can come from the fact that pyspark is looking for clustering.test_func and not test_func.

The key here is the Traceback you got.
PySpark is telling you that the worker process doesn't have access to clustering.test_func.py. When you initialize the SparkContext you can pass a list of files that should be copied to the worker:
sc = SparkContext("local[1]", "App Name", pyFiles=['MyFile.py', 'lib.zip', 'app.egg'])
More information: https://spark.apache.org/docs/1.5.2/programming-guide.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

zeppelin pyspark how to connect remote spark? - python

Enter http://zeppelin_host:zeppelin_port/#/interpreter config parameter master of spark interpreter (which use for pyspark) to spark://xxxx:7077

Related

How to load sql file in spark using python

How to run an AWS Glue Python Spark Job with the Current Version of boto3?

Unable to create spark session

No output operations registered, so nothing to execute in PySpark

Pyspark - Using a custom function in map transform

Categories

Resources