Not able to connect to kafka topic using spark streaming (python, jupyter) - python

I tried to connect to kafka topic using spark. It's not reading any data in its dstream or giving any error.
Here is my jupyter code:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell'
from pretty import pprint
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming', {'topic_name':1})
kafkaStream.pprint()
Nothing gets printed. Also tried with createDirectStream but didn't get any output. Followed Spark Streaming not reading from Kafka topics and added PYTHONPATH but it didn't help either.
Any help would be deeply appreciated. Thanks!

It's not clear if you are sending any data., but you're not actually starting consumption
You'll need this at the end
ssc.start()
ssc.awaitTermination()
You need to add auto.offset.reset" : "smallest" in the createStream properties to read existing topic data.
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"bootstrap-servers": brokers, "auto.offset.reset" : "smallest"})

As cricket_007 mentioned Structured Streaming is generally preferred. If you still want to handle it with directStream method sample as in below .
Note : Trying to read the message from topic 'topicname' and rewriting into another topic called 'compacttopic'
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='localhost:9092')
def handler(message):
records = message.collect()
for record in records:
value_all=record[1]
value_spt=value_all.split('|')
value_key=value_spt[0]
print (value_key)
producer.send('compacttopic', key=str(value_key),value=str(record[1]))
producer.flush()
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 10)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, ['topicname'], {"metadata.broker.list": 'localhost:9092'})
kvs.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
spark-submit command :
./bin/spark-submit --jars /Users/KarthikeyanDurairaj/jarfiles/spark-streaming-kafka-0-8-assembly_2.11-2.3.1.jar topictotopic.py localhost:9092 topicname
Note : Adjust the jar version based on your spark installed version .
Structured Streaming Approach :
You can refer the below stack overflow link for pyspark based Structured Streaming.
Failed to find leader for topics; java.lang.NullPointerException NullPointerException at org.apache.kafka.common.utils.Utils.formatAddress

Related

Getting objects from S3 bucket using PySpark

I'm trying to get JSON objects from an S3 bucket using PySpark (on Windows, using wsl2 terminal).
I can do this using boto3 as an intermediate step but, when I try to use the spark.read.json method, I get an error.
Code:
import findspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
import os
import multiprocessing
#----------------APACHE CONFIGURATIONS--------------
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
#---------------spark--------------
conf = (
SparkConf()
.set('spark.executor.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.driver.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.setAppName('pyspark_aws')
.setMaster(f"local[{multiprocessing.cpu_count()}]")
.setIfMissing("spark.executor.memory", "2g")
)
sc=SparkContext(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
spark=SparkSession(sc)
#--------------hadoop--------------
accessKeyId='xxxxxxxxxxxx'
secretAccessKey='xxxxxxxxx'
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', accessKeyId)
hadoopConf.set('fs.s3a.secret.key', secretAccessKey)
hadoopConf.set('fs.s3a.endpoint', 's3-eu-west-1.amazonaws.com')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
hadoopConf.set('fs.s3a.multipart.size', '419430400')
hadoopConf.set('fs.s3a.multipart.threshold', '2097152000')
hadoopConf.set('fs.s3a.connection.maximum', '500')
hadoopConf.set('s3a.connection.timeout', '600000')
s3_df = spark.read.json('s3a://{bucket}/{directory}/{object}.json')
Error:
py4j.protocol.Py4JJavaError: An error occurred while call
: java.lang.NumberFormatException: For input string: "32M
at java.base/java.lang.NumberFormatException.forI
at java.base/java.lang.Long.parseLong(Long.java:6
at java.base/java.lang.Long.parseLong(Long.java:8
at org.apache.hadoop.conf.Configuration.getLong(C
at org.apache.hadoop.fs.s3a.S3AFileSystem.getDefa
at org.apache.hadoop.fs.FileSystem.getDefaultBloc
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
at org.apache.hadoop.fs.FileSystem.exists(FileSys
at org.apache.spark.sql.execution.datasources.Dat
at org.apache.spark.sql.execution.datasources.Dat
at org.apache.spark.util.ThreadUtils$.$anonfun$pa
at java.base/java.util.concurrent.ForkJoinTask$Ruava.util.coteAction.exec(ForkJoinTask.java:1426)ncurrent.Fojava.base/java.util.concurrent.ForkJoinTask.dorkJoinWorkejava.base/java.util.concurrent.ForkJoinPool$WorThread.runjava.base/java.util.concurrent.ForkJoinPool.sc(ForkJoinWojava.base/java.util.concurrent.ForkJoinPool.rurkerThread.java.base/java.util.concurrent.ForkJoinWorkerTjava:183)
I added the multipart.size, multipart.threshold, connection.maximum, connection.timeout hadoop conf settings when I was getting a similar error earlier (this earlier error had '64M' instead of '32M' and changed when I added these conf settings)
I'm new to Spark so any and all tips/pointers would be helpful!
if needed
the "32M" is the default of "fs.s3a.block.size"
try hadoopConf.set('fs.s3a.block.size', '33554432')
go to https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
you will find the explanations of the "32M" and the "64M"

Exception: Java gateway process exited before sending its port number with pyspark

I am working with python and pyspark in a jupyter notebook. I am trying to read several parquet files from an aws s3 bucket and convert them into a single json file.
This is what I have:
from functools import reduce
from pyspark.sql import DataFrame
bucket = s3.Bucket(name='mybucket')
keys =[]
for key in bucket.objects.all():
keys.append(key.key)
print(keys[0])
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
.master('local') \
.appName('myAppName') \
.config('spark.executor.memory', '5gb') \
.config("spark.cores.max", "6") \
.getOrCreate()
sc = spark.sparkContext
But I am getting:
Exception: Java gateway process exited before sending its port number with pyspark
I am not sure how to fix this, thank you!
Your getting this error because your pyspark is not able to communicate with your cluster. you need to set the value of some global variable like this.
import os
import findspark
findspark.init()
os.environ['PYSPARK_SUBMIT_ARGS'] = """--name job_name --master yarn / local
--conf spark.dynamicAllocation.enabled=true
pyspark-shell"""
os.environ['PYSPARK_PYTHON'] = "python3.6" # what ever version of python your using
os.environ['python'] = "python3.6"
findspark package is optional but it's good to use in case of pyspark.

Passing AWS Credentials in Python Script

I have a python script that gets called by a PHP. The user that invokes this php script is apache and hence, the python file also gets invoked by apache. So, it gives "Unable to locate credentials ". I've set the default credentials via awscli and when I invoke the python script as root, it works.
This is my line of code :
client = boto3.client('ses', region_name=awsregion, aws_access_key_id='AJHHJHJHJ', aws_secret_access_key='asdasd/asdasd/asd')
But, this gives "Invalid Syntax" Error. So, I tried this :
client = boto3.Session(aws_access_key_id='ASDASD', aws_secret_access_key='asd/asdasd/asdasd')
client = boto3.client('ses', region_name=awsregion, aws_access_key_id='ASDASD', aws_secret_access_key='asd/asdasd/asdasd')
Gives the same error as above. Weird thing is that this same thing is mentioned in the documentation. Even though it's not recommended, it should work.
Can somebody help me in fixing this?
Did you ever get this resolved? Here is how I connect to boto3 in my Python scripts:
import boto3
from botocore.exceptions import ClientError
import re
from io import BytesIO
import gzip
import datetime
import dateutil.parser as dparser
from datetime import datetime
import tarfile
import requests
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## Needed glue stuff
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
##
## currently this will run for everything that is in the staging directory of omniture
# set needed parms
myProfileName = 'MyDataLake'
dhiBucket = 'data-lake'
#create boto3 session
try:
session = boto3.Session(aws_access_key_id='aaaaaaaaaaaa', aws_secret_access_key='abcdefghijklmnopqrstuvwxyz', region_name='us-east-1')aws_session_token=None, region_name=None, botocore_session=None
s3 = session.resource('s3') #establish connection to s3
except Exception as conne:
print ("Unable to connect: " + str(conne))
errtxt = requests.post("https://errorcapturesite", data={'message':'Unable to connect to : ' + myProfileName, 'notify':True,'color':'red'})
print(errtxt.text)
exit()

How can I use request module in Spark?

this is the code I used.
from __future__ import print_function
import sys
from pyspark.sql import SparkSession
import sys
sys.path.append('/usr/local/lib/python2.7/site-packages')
import requests
if __name__ == "__main__":
s = Session()
toGet = s.get
spark = SparkSession\
.builder\
.appName("PythonDockerRepoStat")\
.getOrCreate()
lines = spark.read.text('/data/urls.txt').rdd.map(lambda r: r[0])
res = lines.flatMap(lambda x: x.split("\n"))\
.map(lambda x: toGet(x))
output = res.collect()
print(output)
However, I got this error: ImportError: No module named requests.sessions
When launching Spark jobs all dependencies have to be accessible for:
driver interpreter.
executor interpreter.
Extending path:
sys.path.append('/usr/local/lib/python2.7/site-packages')
will affect only local driver interpreter. To set executor environment variables you can:
modify $SPARK_HOME/conf/spark-env.sh
use spark.executorEnv.[EnvironmentVariableName] configuration option (for example by editing $SPARK_HOME/conf/spark-defaults.conf or setting corresponding SparkConf key.
At the same time you should make sure that requests is installed / accessible on every worker node (if not using local / pseudo-distributed mode).

load external libraries inside pyspark code

I have a spark cluster I use in local mode. I want to read a csv with the databricks external library spark.csv. I start my app as follows:
import os
import sys
os.environ["SPARK_HOME"] = "/home/mebuddy/Programs/spark-1.6.0-bin-hadoop2.6"
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, spark_home + "/python")
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))
from pyspark import SparkContext, SparkConf, SQLContext
try:
sc
except NameError:
print('initializing SparkContext...')
sc=SparkContext()
sq = SQLContext(sc)
df = sq.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load("/my/path/to/my/file.csv")
When I run it, I get the following error:
java.lang.ClassNotFoundException: Failed to load class for data source: com.databricks.spark.csv.
My question: how can I load the databricks.spark.csv library INSIDE my python code. I don't want to load it from outside (using --packages) from instance.
I tried to add the following lines but it did not work:
os.environ["SPARK_CLASSPATH"] = '/home/mebuddy/Programs/spark_lib/spark-csv_2.11-1.3.0.jar'
If you create SparkContext from scratch you can for example set PYSPARK_SUBMIT_ARGS before SparkContext is intialized:
os.environ["PYSPARK_SUBMIT_ARGS"] = (
"--packages com.databricks:spark-csv_2.11:1.3.0 pyspark-shell"
)
sc = SparkContext()
If for some reason you expect that SparkContext has been already initialized, as it is suggested by your code, this won't work. In local mode you could try to use Py4J gateway and URLClassLoader but it doesn't look like a good idea and won't work in a cluster mode.

Categories