Kafka consumer: AttributeError: 'list' object has no attribute 'map' - python

I want to read messages from Kafka queue in Python. For example, in Scala it's quite easy to do:
val ssc = new StreamingContext(conf, Seconds(20))
// Divide the topic into partitions
val topicMessages = "myKafkaTopic"
val topicMessagesMap = topicMessages.split(",").map((_, kafkaNumThreads)).toMap
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMessagesMap).map(_._2)
messages.foreachRDD { rdd =>
//...
}
I want to do just the same in Python. This is my current Python code:
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 20)
topicMessages = "myKafkaTopic"
topicMessagesMap = topicMessages.split(",").map((_, kafkaNumThreads)).toMap
messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMessagesMap)
However I get the error at the line topicMessagesMap = topicMessages.split(",").map((_, kafkaNumThreads)).toMap:
AttributeError: 'list' object has no attribute 'map'
How to make this code working?
UPDATE:
If I run this code in Jupyter Notebook, then I get the error shown below:
messages = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {inputKafkaTopic: list})
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.0.0 ...
Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.0.0.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars ...
Do I understand correctly that the only way to make it working is to use spark-submit and it's impossible to run this code from Jupyter/IPython?

Related

Getting objects from S3 bucket using PySpark

I'm trying to get JSON objects from an S3 bucket using PySpark (on Windows, using wsl2 terminal).
I can do this using boto3 as an intermediate step but, when I try to use the spark.read.json method, I get an error.
Code:
import findspark
from pyspark.sql import SparkSession
from pyspark import SparkContext, SparkConf
import os
import multiprocessing
#----------------APACHE CONFIGURATIONS--------------
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell'
#---------------spark--------------
conf = (
SparkConf()
.set('spark.executor.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.set('spark.driver.extraJavaOptions','-Dcom.amazonaws.services.s3.enableV4=true')
.setAppName('pyspark_aws')
.setMaster(f"local[{multiprocessing.cpu_count()}]")
.setIfMissing("spark.executor.memory", "2g")
)
sc=SparkContext(conf=conf)
sc.setSystemProperty('com.amazonaws.services.s3.enableV4', 'true')
spark=SparkSession(sc)
#--------------hadoop--------------
accessKeyId='xxxxxxxxxxxx'
secretAccessKey='xxxxxxxxx'
hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set('fs.s3a.access.key', accessKeyId)
hadoopConf.set('fs.s3a.secret.key', secretAccessKey)
hadoopConf.set('fs.s3a.endpoint', 's3-eu-west-1.amazonaws.com')
hadoopConf.set('fs.s3a.impl', 'org.apache.hadoop.fs.s3a.S3AFileSystem')
hadoopConf.set('fs.s3a.multipart.size', '419430400')
hadoopConf.set('fs.s3a.multipart.threshold', '2097152000')
hadoopConf.set('fs.s3a.connection.maximum', '500')
hadoopConf.set('s3a.connection.timeout', '600000')
s3_df = spark.read.json('s3a://{bucket}/{directory}/{object}.json')
Error:
py4j.protocol.Py4JJavaError: An error occurred while call
: java.lang.NumberFormatException: For input string: "32M
at java.base/java.lang.NumberFormatException.forI
at java.base/java.lang.Long.parseLong(Long.java:6
at java.base/java.lang.Long.parseLong(Long.java:8
at org.apache.hadoop.conf.Configuration.getLong(C
at org.apache.hadoop.fs.s3a.S3AFileSystem.getDefa
at org.apache.hadoop.fs.FileSystem.getDefaultBloc
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
at org.apache.hadoop.fs.s3a.S3AFileSystem.getFile
at org.apache.hadoop.fs.FileSystem.exists(FileSys
at org.apache.spark.sql.execution.datasources.Dat
at org.apache.spark.sql.execution.datasources.Dat
at org.apache.spark.util.ThreadUtils$.$anonfun$pa
at java.base/java.util.concurrent.ForkJoinTask$Ruava.util.coteAction.exec(ForkJoinTask.java:1426)ncurrent.Fojava.base/java.util.concurrent.ForkJoinTask.dorkJoinWorkejava.base/java.util.concurrent.ForkJoinPool$WorThread.runjava.base/java.util.concurrent.ForkJoinPool.sc(ForkJoinWojava.base/java.util.concurrent.ForkJoinPool.rurkerThread.java.base/java.util.concurrent.ForkJoinWorkerTjava:183)
I added the multipart.size, multipart.threshold, connection.maximum, connection.timeout hadoop conf settings when I was getting a similar error earlier (this earlier error had '64M' instead of '32M' and changed when I added these conf settings)
I'm new to Spark so any and all tips/pointers would be helpful!
if needed
the "32M" is the default of "fs.s3a.block.size"
try hadoopConf.set('fs.s3a.block.size', '33554432')
go to https://hadoop.apache.org/docs/current/hadoop-aws/tools/hadoop-aws/index.html
you will find the explanations of the "32M" and the "64M"

How to run an AWS Glue Python Spark Job with the Current Version of boto3?

I'm trying to run the latest version of boto3 in an AWS Glue spark job to access methods that aren't available in the default version in Glue.
To get the default version of boto3 and verify the method I want to access isn't available I run this block of code which is all boilerplate except for my print statements:
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
athena = boto3.client('athena')
print(boto3.__version__) # verify the default version boto3 imports
print(athena.list_table_metadata) # method I want to verify I can access in Glue
job.commit()
which returns
1.12.4
Traceback (most recent call last): File "/tmp/another_sample", line 20, in print(athena.list_table_metadata) File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 566, in getattr self.class.name, item) AttributeError: 'Athena' object has no attribute 'list_table_metadata'
Ok, as expected with an older version of boto3. Let's try and import the latest version...
I perform the following steps:
Go to https://pypi.org/project/boto3/#files
Download the boto3-1.17.13-py2.py3-none-any.whl file
Place it in S3 location
Go back to the Glue Job and under the Security configuration, script libraries, and job parameters (optional) section I update the Python library path with the S3 location from step 3
Rerun block of code from above
which returns
1.17.9
Traceback (most recent call last): File "/tmp/another_sample", line 20, in print(athena.list_table_metadata) File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 566, in getattr self.class.name, item) AttributeError: 'Athena' object has no attribute 'list_table_metadata'
If I run this same script locally, which is running 1.17.9 I can find the method:
1.17.9
<bound method ClientCreator._create_api_method.._api_call of <botocore.client.Athena object at 0x7efd8a4f4710>>
Any ideas on what's going on here and how to access the methods that I would expect should be imported in the upgraded version?
Ended up finding a work-around solution in the AWS documentation.
Added the following Key/Value pair in the Glue Job parameters under the Security configuration, script libraries, and job parameters (optional) section of the job:
Key:
--additional-python-modules
Value:
botocore>=1.20.12,boto3>=1.17.12

Exception: Java gateway process exited before sending its port number with pyspark

I am working with python and pyspark in a jupyter notebook. I am trying to read several parquet files from an aws s3 bucket and convert them into a single json file.
This is what I have:
from functools import reduce
from pyspark.sql import DataFrame
bucket = s3.Bucket(name='mybucket')
keys =[]
for key in bucket.objects.all():
keys.append(key.key)
print(keys[0])
from pyspark.sql import SparkSession
# initialise sparkContext
spark = SparkSession.builder \
.master('local') \
.appName('myAppName') \
.config('spark.executor.memory', '5gb') \
.config("spark.cores.max", "6") \
.getOrCreate()
sc = spark.sparkContext
But I am getting:
Exception: Java gateway process exited before sending its port number with pyspark
I am not sure how to fix this, thank you!
Your getting this error because your pyspark is not able to communicate with your cluster. you need to set the value of some global variable like this.
import os
import findspark
findspark.init()
os.environ['PYSPARK_SUBMIT_ARGS'] = """--name job_name --master yarn / local
--conf spark.dynamicAllocation.enabled=true
pyspark-shell"""
os.environ['PYSPARK_PYTHON'] = "python3.6" # what ever version of python your using
os.environ['python'] = "python3.6"
findspark package is optional but it's good to use in case of pyspark.

Not able to connect to kafka topic using spark streaming (python, jupyter)

I tried to connect to kafka topic using spark. It's not reading any data in its dstream or giving any error.
Here is my jupyter code:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.2 pyspark-shell'
from pretty import pprint
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
sc = SparkContext(appName="PythonSparkStreamingKafka_RM_01")
sc.setLogLevel("WARN")
ssc = StreamingContext(sc, 60)
kafkaStream = KafkaUtils.createStream(ssc, 'localhost:2181', 'spark-streaming', {'topic_name':1})
kafkaStream.pprint()
Nothing gets printed. Also tried with createDirectStream but didn't get any output. Followed Spark Streaming not reading from Kafka topics and added PYTHONPATH but it didn't help either.
Any help would be deeply appreciated. Thanks!
It's not clear if you are sending any data., but you're not actually starting consumption
You'll need this at the end
ssc.start()
ssc.awaitTermination()
You need to add auto.offset.reset" : "smallest" in the createStream properties to read existing topic data.
from pyspark.streaming.kafka import KafkaUtils
directKafkaStream = KafkaUtils.createDirectStream(ssc, [topic], {"bootstrap-servers": brokers, "auto.offset.reset" : "smallest"})
As cricket_007 mentioned Structured Streaming is generally preferred. If you still want to handle it with directStream method sample as in below .
Note : Trying to read the message from topic 'topicname' and rewriting into another topic called 'compacttopic'
from pyspark import SparkConf, SparkContext
from operator import add
import sys
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
import json
from kafka import SimpleProducer, KafkaClient
from kafka import KafkaProducer
producer = KafkaProducer(bootstrap_servers='localhost:9092')
def handler(message):
records = message.collect()
for record in records:
value_all=record[1]
value_spt=value_all.split('|')
value_key=value_spt[0]
print (value_key)
producer.send('compacttopic', key=str(value_key),value=str(record[1]))
producer.flush()
def main():
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 10)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc, ['topicname'], {"metadata.broker.list": 'localhost:9092'})
kvs.foreachRDD(handler)
ssc.start()
ssc.awaitTermination()
if __name__ == "__main__":
main()
spark-submit command :
./bin/spark-submit --jars /Users/KarthikeyanDurairaj/jarfiles/spark-streaming-kafka-0-8-assembly_2.11-2.3.1.jar topictotopic.py localhost:9092 topicname
Note : Adjust the jar version based on your spark installed version .
Structured Streaming Approach :
You can refer the below stack overflow link for pyspark based Structured Streaming.
Failed to find leader for topics; java.lang.NullPointerException NullPointerException at org.apache.kafka.common.utils.Utils.formatAddress

No output operations registered, so nothing to execute in PySpark

I am trying to integtrate Spark with Kafka. I am having the kafka consumer have json data. I want to inegrate the kafka consumer with Spark for processing. When I run the below code error is throwing.
bin\spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 test.py localhost:9092 maktest
My test.py is below
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc,[topic],{"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
print (lines)
ssc.start()
ssc.awaitTermination()
I got the error below
18/12/10 16:41:40 INFO VerifiableProperties: Verifying properties
18/12/10 16:41:40 INFO VerifiableProperties: Property group.id is overridden to
18/12/10 16:41:40 INFO VerifiableProperties: Property zookeeper.connect is overridden to
<pyspark.streaming.kafka.KafkaTransformedDStream object at 0x000002A6DA9FE6A0>
18/12/10 16:41:40 ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:224)
Traceback (most recent call last):
File "C:/Users/maws/Desktop/spark-2.2.1-bin-hadoop2.7/test.py", line 12, in <module>
ssc.start()
py4j.protocol.Py4JJavaError: An error occurred while calling o25.start.
: java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
You're not using a supported Spark Streaming DStream output operation.
For the pyspark API, you should use:
pprint()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
print() can't be used with pyspark, so make sure when you check other Spark Streaming Examples for Scala or Java, you change to pprint()

Categories