No output operations registered, so nothing to execute in PySpark - python

I am trying to integtrate Spark with Kafka. I am having the kafka consumer have json data. I want to inegrate the kafka consumer with Spark for processing. When I run the below code error is throwing.
bin\spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 test.py localhost:9092 maktest
My test.py is below
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc,[topic],{"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
print (lines)
ssc.start()
ssc.awaitTermination()
I got the error below
18/12/10 16:41:40 INFO VerifiableProperties: Verifying properties
18/12/10 16:41:40 INFO VerifiableProperties: Property group.id is overridden to
18/12/10 16:41:40 INFO VerifiableProperties: Property zookeeper.connect is overridden to
<pyspark.streaming.kafka.KafkaTransformedDStream object at 0x000002A6DA9FE6A0>
18/12/10 16:41:40 ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:224)
Traceback (most recent call last):
File "C:/Users/maws/Desktop/spark-2.2.1-bin-hadoop2.7/test.py", line 12, in <module>
ssc.start()
py4j.protocol.Py4JJavaError: An error occurred while calling o25.start.
: java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute

You're not using a supported Spark Streaming DStream output operation.
For the pyspark API, you should use:
pprint()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
print() can't be used with pyspark, so make sure when you check other Spark Streaming Examples for Scala or Java, you change to pprint()

Related

How to run an AWS Glue Python Spark Job with the Current Version of boto3?

I'm trying to run the latest version of boto3 in an AWS Glue spark job to access methods that aren't available in the default version in Glue.
To get the default version of boto3 and verify the method I want to access isn't available I run this block of code which is all boilerplate except for my print statements:
import sys
import boto3
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
## #params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
athena = boto3.client('athena')
print(boto3.__version__) # verify the default version boto3 imports
print(athena.list_table_metadata) # method I want to verify I can access in Glue
job.commit()
which returns
1.12.4
Traceback (most recent call last): File "/tmp/another_sample", line 20, in print(athena.list_table_metadata) File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 566, in getattr self.class.name, item) AttributeError: 'Athena' object has no attribute 'list_table_metadata'
Ok, as expected with an older version of boto3. Let's try and import the latest version...
I perform the following steps:
Go to https://pypi.org/project/boto3/#files
Download the boto3-1.17.13-py2.py3-none-any.whl file
Place it in S3 location
Go back to the Glue Job and under the Security configuration, script libraries, and job parameters (optional) section I update the Python library path with the S3 location from step 3
Rerun block of code from above
which returns
1.17.9
Traceback (most recent call last): File "/tmp/another_sample", line 20, in print(athena.list_table_metadata) File "/home/spark/.local/lib/python3.7/site-packages/botocore/client.py", line 566, in getattr self.class.name, item) AttributeError: 'Athena' object has no attribute 'list_table_metadata'
If I run this same script locally, which is running 1.17.9 I can find the method:
1.17.9
<bound method ClientCreator._create_api_method.._api_call of <botocore.client.Athena object at 0x7efd8a4f4710>>
Any ideas on what's going on here and how to access the methods that I would expect should be imported in the upgraded version?
Ended up finding a work-around solution in the AWS documentation.
Added the following Key/Value pair in the Glue Job parameters under the Security configuration, script libraries, and job parameters (optional) section of the job:
Key:
--additional-python-modules
Value:
botocore>=1.20.12,boto3>=1.17.12

KafkaRecord cannot be cast to [B

I'm trying to process the data streaming from Apache Kafka using the Python SDK for Apache Beam with the Flink runner. After running Kafka 2.4.0 and Flink 1.8.3, I follow these steps:
1) Compile and run Beam 2.16 with Flink 1.8 runner.
git clone --single-branch --branch release-2.16.0 https://github.com/apache/beam.git beam-2.16.0
cd beam-2.16.0
nohup ./gradlew :runners:flink:1.8:job-server:runShadow -PflinkMasterUrl=localhost:8081 &
2) Run the Python pipeline.
from apache_beam import Pipeline
from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
if __name__ == '__main__':
with Pipeline(options=PipelineOptions([
'--runner=FlinkRunner',
'--flink_version=1.8',
'--flink_master_url=localhost:8081',
'--environment_type=LOOPBACK',
'--streaming'
])) as pipeline:
(
pipeline
| 'read' >> ReadFromKafka({'bootstrap.servers': 'localhost:9092'}, ['test']) # [BEAM-3788] ???
)
result = pipeline.run()
result.wait_until_finish()
3) Publish some data to Kafka.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>{"hello":"world!"}
The Python script throws this error:
[flink-runner-job-invoker] ERROR org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation - Error during job invocation BeamApp-USER-somejob. org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: xxx)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:483)
at org.apache.beam.runners.flink.FlinkExecutionEnvironments$BeamFlinkRemoteStreamEnvironment.executeRemotely(FlinkExecutionEnvironments.java:360)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:310)
at org.apache.beam.runners.flink.FlinkStreamingPortablePipelineTranslator$StreamingTranslationContext.execute(FlinkStreamingPortablePipelineTranslator.java:173)
at org.apache.beam.runners.flink.FlinkPipelineRunner.runPipelineWithTranslator(FlinkPipelineRunner.java:104)
at org.apache.beam.runners.flink.FlinkPipelineRunner.run(FlinkPipelineRunner.java:80)
at org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation.runPipeline(JobInvocation.java:78)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
... 13 more
Caused by: java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
at org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:41)
at org.apache.beam.sdk.coders.LengthPrefixCoder.encode(LengthPrefixCoder.java:56)
at org.apache.beam.sdk.values.ValueWithRecordId$ValueWithRecordIdCoder.encode(ValueWithRecordId.java:105)
at org.apache.beam.sdk.values.ValueWithRecordId$ValueWithRecordIdCoder.encode(ValueWithRecordId.java:81)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:578)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:529)
at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:82)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.copy(CoderTypeSerializer.java:67)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:577)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:305)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:394) at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.emitElement(UnboundedSourceWrapper.java:341)
at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:283)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
at org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
... 1 more
ERROR:root:java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.BeamFileSystemArtifactRetrievalService - Manifest at/tmp/artifacts0k1mnin0/somejob/MANIFEST has 0 artifact locations
[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.BeamFileSystemArtifactStagingService - Removed dir /tmp/artifacts0k1mnin0/job_somejob/
Traceback (most recent call last):
File "main.py", line 40, in <module>
run()
File "main.py", line 37, in run
result.wait_until_finish()
File "/home/USER/beam/lib/python3.5/site-packages/apache_beam/runners/portability/portable_runner.py", line 439, in wait_until_finish self._job_id, self._state, self._last_error_message()))
RuntimeError: Pipeline BeamApp-USER-somejob failed in state FAILED: java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
I tried other deserializers available in Kafka but they did not work: Couldn't infer Coder from class org.apache.kafka.common.serialization.StringDeserializer. This error is originating from this piece of code.
Am I doing something wrong?
Disclaimer: this is my first encounter with Apache Beam project.
It seems that Kafka consumer support is quite fresh thing in Beam (at least in Python interface) according to this JIRA. Apparently, it seems that there is still problem with FlinkRunner combined with this new API. Even though your code is technically correct it will not run correctly on Flink. There is a patch available which seems more like a quickfix than final solution to me. It requires recompilation and thus is not something I would propose using on production. If you are just getting started with technology and don't want to be blocked then feel free to try it out.

zeppelin pyspark how to connect remote spark?

My zeppelin is using local spark now.
Got ValueError: Cannot run multiple SparkContexts at once when I tried to create a remote SparkContext .
Follow
multiple SparkContexts error in tutorial
write below code:
from pyspark import SparkConf, SparkContext
sc.stop()
conf = SparkConf().setAppName('train_etl').setMaster('spark://xxxx:7077')
sc = SparkContext(conf=conf)
Got another error:
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-6681108227268089746.py", line 363, in <module>
sc.setJobGroup(jobGroup, jobDesc)
File "/usr/local/spark/python/lib/pyspark.zip/pyspark/context.py", line 944, in setJobGroup
self._jsc.setJobGroup(groupId, description, interruptOnCancel)
AttributeError: 'NoneType' object has no attribute 'setJobGroup'
What should I do?
by default, Spark automatically creates the SparkContext object named sc, when the PySpark application started.you have to use below line in your code which
sc = SparkContext.getOrCreate()
Get the singleton SQLContext if it exists or create a new one using the given SparkContext.
This function can be used to create a singleton SQLContext object that can be shared across the JVM.
If there is an active SQLContext for current thread, it will be returned instead of the global one.
Enter http://zeppelin_host:zeppelin_port/#/interpreter
config parameter master of spark interpreter (which use for pyspark) to spark://xxxx:7077

Kafka consumer: AttributeError: 'list' object has no attribute 'map'

I want to read messages from Kafka queue in Python. For example, in Scala it's quite easy to do:
val ssc = new StreamingContext(conf, Seconds(20))
// Divide the topic into partitions
val topicMessages = "myKafkaTopic"
val topicMessagesMap = topicMessages.split(",").map((_, kafkaNumThreads)).toMap
val messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMessagesMap).map(_._2)
messages.foreachRDD { rdd =>
//...
}
I want to do just the same in Python. This is my current Python code:
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
ssc = StreamingContext(sc, 20)
topicMessages = "myKafkaTopic"
topicMessagesMap = topicMessages.split(",").map((_, kafkaNumThreads)).toMap
messages = KafkaUtils.createStream(ssc, zkQuorum, group, topicMessagesMap)
However I get the error at the line topicMessagesMap = topicMessages.split(",").map((_, kafkaNumThreads)).toMap:
AttributeError: 'list' object has no attribute 'map'
How to make this code working?
UPDATE:
If I run this code in Jupyter Notebook, then I get the error shown below:
messages = KafkaUtils.createStream(ssc, zkQuorum, "spark-streaming-consumer", {inputKafkaTopic: list})
Spark Streaming's Kafka libraries not found in class path. Try one
of the following.
Include the Kafka library and its dependencies with in the
spark-submit command as
$ bin/spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8:2.0.0 ...
Download the JAR of the artifact from Maven Central http://search.maven.org/,
Group Id = org.apache.spark, Artifact Id = spark-streaming-kafka-0-8-assembly, Version = 2.0.0.
Then, include the jar in the spark-submit command as
$ bin/spark-submit --jars ...
Do I understand correctly that the only way to make it working is to use spark-submit and it's impossible to run this code from Jupyter/IPython?

Pyspark - Using a custom function in map transform

I run the following file (called test_func.py) with py.test:
import findspark
findspark.init()
from pyspark.context import SparkContext
def filtering(data):
return data.map(lambda p: modif(p)).count()
def modif(row):
row.split(",")
class Test(object):
sc = SparkContext('local[1]')
def test_filtering(self):
data = self.sc.parallelize(['1','2', ''])
assert filtering(data) == 2
And, because the modif function is used inside the map transform, it fails with the following error:
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/home/osboxes/spark-1.5.2-bin-hadoop2.4/python/lib/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
ImportError: No module named clustering.test_func
pyspark does not manage to find the modif function. Note that the file test_func.py is in the directory clustering and I run py.test from inside clustering directory.
What astonishes me is that if I use modif function outside a map, it works fine. For instance, if I do: modif(data.first())
Any idea why I get such importErrors and how I could fix it?
EDIT
I had already tested what has been proposed Avihoo Mamka's answer, i.e. to add test_func.py to the files copied to all workers. However, it had no effect. It is no surprise for me because I think the main file, from where the spark application is created, is always sent to all workers.
I think it can come from the fact that pyspark is looking for clustering.test_func and not test_func.
The key here is the Traceback you got.
PySpark is telling you that the worker process doesn't have access to clustering.test_func.py. When you initialize the SparkContext you can pass a list of files that should be copied to the worker:
sc = SparkContext("local[1]", "App Name", pyFiles=['MyFile.py', 'lib.zip', 'app.egg'])
More information: https://spark.apache.org/docs/1.5.2/programming-guide.html

Categories