Run kafkastream with JAR artifact in Jupyter - python

I am working on a simple python script to stream messages from Kafka using pyspark, and I'm doing so using jupyter.
I get an error message saying Spark Streaming's Kafka libraries not found in class path (more details below). I included the solution suggested by #tshilidzi-mudau in a previous post (and confirmed in the docs) to avoid this problem. What should I do to fix the bug?
Following what suggested in the error prompt, I downloaded the JAR of the artifact, stored it in $SPARK_HOME/jars and included the reference in the code.
Here is the code:
import os
from __future__ import print_function
import sys
from pyspark.streaming import StreamingContext
from pyspark import SparkContext,SparkConf
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars spark-streaming-kafka-0-10-assembly_2.10-2.2.2.jar pyspark-shell' #note that the "pyspark-shell" part is very important!!.
#conf = SparkConf().setAppName("Kafka-Spark").setMaster("spark://127.0.0.1:7077")
conf = SparkConf().setAppName("Kafka-Spark")
#sc = SparkContext(appName="KafkaSpark")
try:
sc.stop()
except:
pass
sc = SparkContext(conf=conf)
stream=StreamingContext(sc,1)
map1={'spark-kafka':1}
kafkaStream = KafkaUtils.createStream(stream, 'localhost:9092', "name", map1) #tried with localhost:2181 too
print("kafkastream=",kafkaStream)
sc.stop()
And this is the error:
Spark Streaming's Kafka libraries not found in class path. Try one of the following.
1. Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.2.2 ...
2. Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.2.2.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars <spark-streaming-kafka-0-8-assembly.jar> ...
TypeError Traceback (most recent call last)
<ipython-input-9-34de7dbdfc7c> in <module>()
13 ssc = StreamingContext(sc,1)
14 broker = "<my_broker_ip>"
---> 15 directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"], {"metadata.broker.list": broker})
16 directKafkaStream.pprint()
17 ssc.start()
/opt/spark/python/pyspark/streaming/kafka.pyc in createDirectStream(ssc, topics, kafkaParams, fromOffsets, keyDecoder, valueDecoder, messageHandler)
120 return messageHandler(m)
121
--> 122 helper = KafkaUtils._get_helper(ssc._sc)
123
124 jfromOffsets = dict([(k._jTopicAndPartition(helper),
/opt/spark/python/pyspark/streaming/kafka.pyc in _get_helper(sc)
193 def _get_helper(sc):
194 try:
--> 195 return sc._jvm.org.apache.spark.streaming.kafka.KafkaUtilsPythonHelper()
196 except TypeError as e:
197 if str(e) == "'JavaPackage' object is not callable":
TypeError: 'JavaPackage' object is not callable

Related

How to run an AWS Glue Python Spark Job with the Current Version of boto3?

I'm trying to run the latest version of boto3 in an AWS Glue spark job to access methods that aren't available in the default version in Glue.
To get the default version of boto3 and verify the method I want to access isn't available I run this block of code which is all boilerplate except for my print statements:
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
athena = boto3.client('athena')
print(boto3.__version__) # verify the default version boto3 imports
print(athena.list_table_metadata) # method I want to verify I can access in Glue
job.commit()
which returns
1.12.4
Traceback (most recent call last): File "/tmp/another_sample", line 20, in print(athena.list_table_metadata) File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 566, in getattr self.class.name, item) AttributeError: 'Athena' object has no attribute 'list_table_metadata'
Ok, as expected with an older version of boto3. Let's try and import the latest version...
I perform the following steps:
Go to https://pypi.org/project/boto3/#files
Download the boto3-1.17.13-py2.py3-none-any.whl file
Place it in S3 location
Go back to the Glue Job and under the Security configuration, script libraries, and job parameters (optional) section I update the Python library path with the S3 location from step 3
Rerun block of code from above
which returns
1.17.9
Traceback (most recent call last): File "/tmp/another_sample", line 20, in print(athena.list_table_metadata) File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 566, in getattr self.class.name, item) AttributeError: 'Athena' object has no attribute 'list_table_metadata'
If I run this same script locally, which is running 1.17.9 I can find the method:
1.17.9
<bound method ClientCreator._create_api_method.._api_call of <botocore.client.Athena object at 0x7efd8a4f4710>>
Any ideas on what's going on here and how to access the methods that I would expect should be imported in the upgraded version?
Ended up finding a work-around solution in the AWS documentation.
Added the following Key/Value pair in the Glue Job parameters under the Security configuration, script libraries, and job parameters (optional) section of the job:
Key:
--additional-python-modules
Value:
botocore>=1.20.12,boto3>=1.17.12

Running Jupyter Notebook on AWS Lambda

I am trying to run Jupyter Notebook on AWS Lambda, created a layer with all the dependencies, the jupyter notebook is a simple code which pulls a csv file from amazon S3 and displays the data as bar graph. Below is the lambda function written to download the .ipynb file and execute the notebook with papermill. Not sure why its failing with boto3 module not found.
import json
import sys
import os
import boto3
# papermill to execute notebook
import papermill as pm
import pandas as pd
import logging
import matplotlib.pyplot as plt
sys.path.append("/opt/bin")
sys.path.append("/opt/python")
os.environ["PYTHONPATH"]='/var/task'
os.environ["PYTHONPATH"]='/opt/python/'
os.environ["MPLCONFIGDIR"] = '/tmp/'
# ipython needs a writeable directory
os.environ["IPYTHONDIR"]='/tmp/ipythondir'
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
s3 = boto3.resource('s3')
s3.meta.client.download_file('test-boto', 'testing.ipynb', '/tmp/test.ipynb')
pm.execute_notebook('/tmp/test.ipynb', '/tmp/juptest_output.ipynb', kernel_name='python3')
s3_client.upload_file('/tmp/juptest_output.ipynb', 'test-boto', 'temp/juptest_output.ipynb')
logger.info(event)
Error o/p:
START RequestId: c4da3406-c829-4f99-9fbf-b231a0d3dc06 Version: $LATEST
[INFO] 2020-08-07T17:55:16.602Z c4da3406-c829-4f99-9fbf-b231a0d3dc06 Input Notebook: /tmp/test.ipynb
[INFO] 2020-08-07T17:55:16.603Z c4da3406-c829-4f99-9fbf-b231a0d3dc06 Output Notebook: /tmp/juptest_output.ipynb
Executing: 0%| | 0/15 [00:00<?, ?cell/s][INFO] 2020-08-07T17:55:17.311Z c4da3406-c829-4f99-9fbf-b231a0d3dc06 Executing notebook with kernel: python3
OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k
Executing: 7%|▋ | 1/15 [00:01<00:14, 1.06s/cell]
Executing: 7%|▋ | 1/15 [00:01<00:20, 1.46s/cell]
[ERROR] PapermillExecutionError:
---------------------------------------------------------------------------
Exception encountered at "In [1]":
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-1-9c332490c231> in <module>
1 import pandas as pd
2 import os
----> 3 import boto3
4 import matplotlib.pyplot as plt
5 client = boto3.client('s3')
ModuleNotFoundError: No module named 'boto3'
Traceback (most recent call last):
File "/var/task/lambda_function.py", line 28, in lambda_handler
pm.execute_notebook('/tmp/test.ipynb', '/tmp/juptest_output.ipynb', kernel_name='python3')
File "/opt/python/papermill/execute.py", line 110, in execute_notebook
raise_for_execution_errors(nb, output_path)
File "/opt/python/papermill/execute.py", line 222, in raise_for_execution_errors
raise errorEND RequestId: c4da3406-c829-4f99-9fbf-b231a0d3dc06
REPORT RequestId:c4da3406-c829-4f99-9fbf-b231a0d3dc06
Duration: 1624.78 ms Billed Duration: 1700 ms Memory Size: 3008 MB Max Memory Used: 293 MB
Jupyter Notebook:
import pandas as pd
import os
import boto3
import matplotlib.pyplot as plt
client = boto3.client('s3')
path = 's3://test-boto/aws-costs-Owner-Month-08.csv'
monthly_owner = pd.read_csv(path)
plt.bar(monthly_owner.Owner.head(6),monthly_owner.Amount.head(6))
plt.xlabel('Owner', fontsize=15)
plt.ylabel('Amount', fontsize=15)
plt.title('AWS Monthly Cost by Owner')
plt.show()
It looks like papermill kernel is not able to detect boto3 package even though your lambda handler is able to find it. I see you are overriding (not appending) PYTHONPATH in your lambda handler. This will remove other directories from PYTHONPATH to look for packages. Papermill child process will use this python path subsequently.
You might also find this useful. It allows you to directly deploy Jupyter Notebooks as serverless functions. It uses papermill behind the scene.
Disclaimer: I work for Clouderizer.

Unable to create spark session

When I am creating a spark session, it is throwing an error
Unable to create a spark session
Using pyspark, the code snippet:
ValueError Traceback (most recent call last)
<ipython-input-13-2262882856df> in <module>()
37 if __name__ == "__main__":
38 conf = SparkConf()
---> 39 sc = SparkContext(conf=conf)
40 # print(sc.version)
41 # sc = SparkContext(conf=conf)
~/anaconda3/lib/python3.5/site-packages/pyspark/context.py in __init__(self, master, appName, sparkHome, pyFiles, environment, batchSize, serializer, conf, gateway, jsc, profiler_cls)
131 " note this option will be removed in Spark 3.0")
132
--> 133 SparkContext._ensure_initialized(self, gateway=gateway, conf=conf)
134 try:
135 self._do_init(master, appName, sparkHome, pyFiles, environment, batchSize, serializer,
~/anaconda3/lib/python3.5/site-packages/pyspark/context.py in _ensure_initialized(cls, instance, gateway, conf)
330 " created by %s at %s:%s "
331 % (currentAppName, currentMaster,
--> 332 callsite.function, callsite.file, callsite.linenum))
333 else:
334 SparkContext._active_spark_context = instance
ValueError: Cannot run multiple SparkContexts at once; existing SparkContext(app=pyspark-shell, master=local[*]) created by __init__ at <ipython-input-7-edf43bdce70a>:33
imports
from pyspark import SparkConf, SparkContext
I tried this alternative approach, which fails too:
spark = SparkSession(sc).builder.appName("Detecting-Malicious-URL App").getOrCreate()
This is throwing another error as follows:
NameError: name 'SparkSession' is not defined
Try this -
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Detecting-Malicious-URL App").getOrCreate()
Before spark 2.0 we had to create a SparkConf and SparkContext to interact with Spark.
Whereas in Spark 2.0 SparkSession is the entry point to Spark SQL. Now we don't need to create SparkConf, SparkContext or SQLContext, as they’re encapsulated within the SparkSession.
Please refer this blog for more details : How to use SparkSession in Apache Spark 2.0
spark context is used to connect to the cluster through a resource manager.
sparkConf is required to create the spark context object, which stores configuration parameter like appName (to identify your spark driver), application, number of core and memory size of executor running on worker node. To use APIs of Sql, Hive, Streaming separate contexts need to be created.
While as for SparkSession provides a single point of entry to interact with underlying Spark functionality and allows programming Spark with Dataframes and API's. It don't need to create a separate session to use Sql, Hive etc.
To create a SparkSession you might use the following builder
SparkSession.builder.master("local").appName("Detecting-Malicious-URL
App") .config("spark.some.config.option", "some-value")
To overcome this error
"NameError: name 'SparkSession' is not defined"
you might need to use a package calling such as
"from pyspark.sql import SparkSession"
pyspark.sql supports spark session which is used to create data frames or register data frames as tables etc.
And the above error
(ValueError: Cannot run multiple SparkContexts at once; existing
SparkContext(app=pyspark-shell, master=local[*]) created by init
at :33 )
you specified this might be helpful- ValueError: Cannot run multiple SparkContexts at once in spark with pyspark
Faced Similar problem on Win 10 Solved by following Way:-
go to the Conda prompt and run the following command:-
# Install OpenJDK 11
conda install openjdk
Then run the:-
SparkSession.builder.appName('...').getOrCreate()

Jupyter Cassandra Save Problem - java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder

I am using Jupyter notebook and want to save csv file to cassandra db. There is no problem while getting data and showing it, But when I try to save this csv data to cassandra db it throws below exception.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost, executor driver): java.lang.NoClassDefFoundError: com/twitter/jsr166e/LongAdder
I dowloaded maven package manually both 2.4.0 and 2.4.1 and none of them worked. Also stated packages at the top of code.
import sys
import uuid
import time
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.datastax.spark:spark-cassandra-connector_2.11:2.4.0 pyspark-shell'
try:
from pyspark import SparkContext
from pyspark import SparkConf
from pyspark.sql import SparkSession
from itertools import islice
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark.sql import Row
from datetime import datetime
except ImportError as e:
print("error importing spark modules", e)
sys.exit(1)
conf = SparkConf().setAppName("Stand Alone Python Script").setMaster("local[*]")\
.setAll([('spark.executor.memory', '8g'),\
('spark.executor.cores', '3'),\
('spark.cores.max', '3'),\
('spark.cassandra.connection.host', 'cassandra_ip'),\
('spark.cassandra.auth.username', 'cassandra_user_name'),\
('spark.cassandra.auth.password', 'cassandra_password'),\
('spark.driver.memory','8g')])
sc = SparkContext(conf=conf)
sql_context = SQLContext(sc)
consumer_complaints = sql_context.read.format("csv").option("header", "true").option("inferSchema", "false").load("in/Consumer_Complaints.csv")
consumer_complaints.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="table_name", keyspace="space_name")\
.save()
sc.stop()
Hello I solved my problem with following steps:
downloaded twitter jsr jar and moved it to $SPARK_HOME/jars directory.
cp /home/jovyan/.m2/repository/com/twitter/jsr166e/1.1.0/jsr166e-1.1.0.jar /usr/local/spark/jars/
Also because of docker's jupyter user is jovyan not root I grant permission to this folder
I used directly below statement but you can use more restrictive way.
chmod -R 777 /usr/local/spark/jars/
Thank you

python code not running on Jupyter notebook

I have an already running python code of the document similarity server
The code runs fine from the commandline, however when I try to run from Jupyter notebook I get the following error (You can find the code below)
AttributeError Traceback (most recent call last)
in ()
----> 1 simServer.queryIndex('National Intergroup Inc said it plans to file a registration statement')
<ipython-input-2-81df834abc60> in queryIndex(self, queryText)
58 print "Querying the INDEX"
59 doc = {'tokens': utils.simple_preprocess(queryText)}
---> 60 print(self.service.find_similar(doc, min_score=0.4, max_results=50))
At first I got a different error message where the solution was to install simserver library within jupyter notebook using the command !pip install --upgrade simserver .. but now I do not think there is a missing library that needs to be downloaded
Relevant code from jupyter notebook:
Line where the issue occurs
simServer.queryIndex('National Intergroup Inc said it plans to file a registration statement')
#!/usr/bin/env python
import pickle
import os
import re
import glob
import pprint
import json
from gensim import utils
from simserver import SessionServer
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
class SimilarityServer(object):
def __init__(self):
print "Openning sesesion and setting it to true"
self.service = SessionServer('tmp/my_server/')
self.service.open_session()
self.service.set_autosession(True)
def indexDocs(self):
print "Docs indexing and training"
#train and index
print "Training"
self.service.session.train(None,method='lsi',clear_buffer=False)
print "Indexing"
self.service.session.index(None)
def queryIndex(self,queryText):
print "Querying the INDEX"
doc = {'tokens': utils.simple_preprocess(queryText)}
print(self.service.find_similar(doc, min_score=0.4, max_results=50))
simServer = SimilarityServer()
simServer.queryIndex('National Intergroup Inc said it plans to file a registration statement')

Categories