I'm trying to process the data streaming from Apache Kafka using the Python SDK for Apache Beam with the Flink runner. After running Kafka 2.4.0 and Flink 1.8.3, I follow these steps:
1) Compile and run Beam 2.16 with Flink 1.8 runner.
git clone --single-branch --branch release-2.16.0 https://github.com/apache/beam.git beam-2.16.0
cd beam-2.16.0
nohup ./gradlew :runners:flink:1.8:job-server:runShadow -PflinkMasterUrl=localhost:8081 &
2) Run the Python pipeline.
from apache_beam import Pipeline
from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
if __name__ == '__main__':
with Pipeline(options=PipelineOptions([
'--runner=FlinkRunner',
'--flink_version=1.8',
'--flink_master_url=localhost:8081',
'--environment_type=LOOPBACK',
'--streaming'
])) as pipeline:
(
pipeline
| 'read' >> ReadFromKafka({'bootstrap.servers': 'localhost:9092'}, ['test']) # [BEAM-3788] ???
)
result = pipeline.run()
result.wait_until_finish()
3) Publish some data to Kafka.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>{"hello":"world!"}
The Python script throws this error:
[flink-runner-job-invoker] ERROR org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation - Error during job invocation BeamApp-USER-somejob. org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: xxx)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:483)
at org.apache.beam.runners.flink.FlinkExecutionEnvironments$BeamFlinkRemoteStreamEnvironment.executeRemotely(FlinkExecutionEnvironments.java:360)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:310)
at org.apache.beam.runners.flink.FlinkStreamingPortablePipelineTranslator$StreamingTranslationContext.execute(FlinkStreamingPortablePipelineTranslator.java:173)
at org.apache.beam.runners.flink.FlinkPipelineRunner.runPipelineWithTranslator(FlinkPipelineRunner.java:104)
at org.apache.beam.runners.flink.FlinkPipelineRunner.run(FlinkPipelineRunner.java:80)
at org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation.runPipeline(JobInvocation.java:78)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
... 13 more
Caused by: java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
at org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:41)
at org.apache.beam.sdk.coders.LengthPrefixCoder.encode(LengthPrefixCoder.java:56)
at org.apache.beam.sdk.values.ValueWithRecordId$ValueWithRecordIdCoder.encode(ValueWithRecordId.java:105)
at org.apache.beam.sdk.values.ValueWithRecordId$ValueWithRecordIdCoder.encode(ValueWithRecordId.java:81)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:578)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:529)
at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:82)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.copy(CoderTypeSerializer.java:67)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:577)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:305)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:394) at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.emitElement(UnboundedSourceWrapper.java:341)
at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:283)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
at org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
... 1 more
ERROR:root:java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.BeamFileSystemArtifactRetrievalService - Manifest at/tmp/artifacts0k1mnin0/somejob/MANIFEST has 0 artifact locations
[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.BeamFileSystemArtifactStagingService - Removed dir /tmp/artifacts0k1mnin0/job_somejob/
Traceback (most recent call last):
File "main.py", line 40, in <module>
run()
File "main.py", line 37, in run
result.wait_until_finish()
File "/home/USER/beam/lib/python3.5/site-packages/apache_beam/runners/portability/portable_runner.py", line 439, in wait_until_finish self._job_id, self._state, self._last_error_message()))
RuntimeError: Pipeline BeamApp-USER-somejob failed in state FAILED: java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
I tried other deserializers available in Kafka but they did not work: Couldn't infer Coder from class org.apache.kafka.common.serialization.StringDeserializer. This error is originating from this piece of code.
Am I doing something wrong?
Disclaimer: this is my first encounter with Apache Beam project.
It seems that Kafka consumer support is quite fresh thing in Beam (at least in Python interface) according to this JIRA. Apparently, it seems that there is still problem with FlinkRunner combined with this new API. Even though your code is technically correct it will not run correctly on Flink. There is a patch available which seems more like a quickfix than final solution to me. It requires recompilation and thus is not something I would propose using on production. If you are just getting started with technology and don't want to be blocked then feel free to try it out.
Related
I have a small Python3-script like this:
import speedtest
# Speedtest
test = speedtest.Speedtest() # <--- line 4
test.get_servers()
best = test.get_best_server()
print(f"Found: {best['host']} located in {best['country']}")
The first time I run it, it works and everything is fine; it outputs:
Found: speedtest.witcom.cloud:8080 located in Germany
Happy days.
The second time (and subsequel times) that I run the script, I get this error:
Traceback (most recent call last):
File "/Users/zeth/Code/pinger/pinger.py", line 4, in <module>
test = speedtest.Speedtest()
File "/usr/local/lib/python3.9/site-packages/speedtest.py", line 1095, in __init__
self.get_config()
File "/usr/local/lib/python3.9/site-packages/speedtest.py", line 1127, in get_config
raise ConfigRetrievalError(e)
speedtest.ConfigRetrievalError: HTTP Error 403: Forbidden
When Googling around, I saw that I could also call this module straight from the command line, but just running this:
$ speedtest-cli
That gives me the same kind of error:
Retrieving speedtest.net configuration...
Cannot retrieve speedtest configuration
ERROR: HTTP Error 403: Forbidden
But if I run the direct cli-command: speedtest-cli --secure ( docs for the --secure-flag ), then it goes through and outputs this:
Retrieving speedtest.net configuration...
Testing from Deutsche Telekom AG (212.185.228.168)...
Retrieving speedtest.net server list...
Selecting best server based on ping...
Hosted by hotspot.koeln (Cologne) [3.44 km]: 28.805 ms
Testing download speed................................................................................
Download: 30.01 Mbit/s
Testing upload speed......................................................................................................
Upload: 8.68 Mbit/s
The question
I can't figure out how to change this Python-line: test = speedtest.Speedtest() to use a --secure-flag (nor via HTTPS).
The documentation for speedtest-cli is scarce.
Other attempts
I found this solution here: Python Speedtest facing problems with certification _ssl.c:1056, that suggests manually approving the certificates.
But in this directory: /Volumes/Macintosh HD/Applications/ I don't have anything called Python3.9. I have python3.9 installed via Brew. And I'm on a Mac.
I could do this:
test = speedtest.Speedtest(secure=True)
I looked into the source code myself, in this directory:
vim /usr/local/lib/python3.9/site-packages/speedtest.py
Where I would see the function was defined like this:
class Speedtest(object):
"""Class for performing standard speedtest.net testing operations"""
def __init__(self, config=None, source_address=None, timeout=10,
secure=False, shutdown_event=None):
self.config = {}
self._source_address = source_address
self._timeout = timeout
self._opener = build_opener(source_address, timeout)
self._secure = secure
...
...
...
I am running following DataFlow config
test_dataflow= BeamRunPythonPipelineOperator(
task_id="xxxx",
runner="DataflowRunner",
py_file=xxxxx,
pipeline_options = dataflow_options,
py_requirements=['apache-beam[gcp]==2.39.0'],
py_interpreter='python3',
dataflow_config=DataflowConfiguration(job_name="{{task.task_id}}", location=LOCATION, project_id=PROJECT, wait_until_finished=False,gcp_conn_id="google_cloud_default")
#dataflow_config={"job_name":"{{task.task_id}}", "location":LOCATION, "project_id":PROJECT, "wait_until_finished":True,"gcp_conn_id":"google_cloud_default"}
)
It keeps throwing error . airflow-2.2.5 version.
Error - Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 287, in execute
) = self._init_pipeline_options(format_pipeline_options=True, job_name_variable_key="job_name")
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 183, in _init_pipeline_options
dataflow_job_name, pipeline_options, process_line_callback = self._set_dataflow(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 63, in _set_dataflow
pipeline_options = self.__get_dataflow_pipeline_options(
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/apache/beam/operators/beam.py", line 92, in __get_dataflow_pipeline_options
if self.dataflow_config.service_account:
AttributeError: 'DataflowConfiguration' object has no attribute 'service_account'
If I give service_account, then it errors saying parameter invalid
I ran into the same issue.
This is because of the inconsistency between the dataflow_configuration in dataflow and the one expected by beam. The DataflowConfiguration doesn't accepting the service_account.
I resolved my issue by upgrading the composer in place, so it gets the latest package related to dataflow where it has been fixed.
The service_account attribute has been added in this commit https://github.com/apache/airflow/commit/de65a5cc5acaa1fc87ae8f65d367e101034294a6
If you can't upgrade composer, try updating the google providers package to the latest version or > version 7.0 ?
You can check the commit in the commit log and identify the minimum version here - https://airflow.apache.org/docs/apache-airflow-providers-google/stable/commits.html#id6
Even though composer uses it's own fork, the oss should work. You can see the list of packages in the composer version list https://cloud.google.com/composer/docs/concepts/versioning/composer-versions it says apache-airflow-providers-google==2022.5.18+composer instead of apache-airflow-providers-google==7.0.
I´ve been strugglin making this work, I know this is a cross-language transform and all of that and I installed the Java jdk on my pc (when I write java -version on cmd I get correct information and all of that) but when I am trying to make a simple pipeline work:
import apache_beam as beam
from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS']='credentialsOld.json'
def main():
print('======================================================')
beam_options = PipelineOptions(runner='DataflowRunner',temp_location=temp_location,staging_location=staging_location,project=project,experiments=['use_runner_v2'],streaming=True)
with beam.Pipeline(options=beam_options) as p:
msgs = p | 'ReadKafka' >> ReadFromKafka(consumer_config={'bootstrap.servers':'xxxxx-xxxxx...','group_id':'testAB'},topics=['users'])
msgs | beam.FlatMap(print)
if __name__ == '__main__':
main()
I get this error: ValueError: Unsupported signal: 2
I have tried adding the parameter expansion_service= 'beam:external:java:kafka:read:v1' to the ReadFromKafka but then I get:
status = StatusCode.UNAVAILABLE
details = "DNS resolution failed for
beam:external:java:kafka:read:v1: UNKNOWN: OS Error"
Im working on a venv python enviroment if this info can be usefull and my kafka cluster is on confluent cloud.
Im also getting this runtime error:
RuntimeError: java.lang.RuntimeException: Failed to get dependencies of beam:transform:org.apache.beam:kafka_read_without_metadata:v1 from spec urn: "beam:transform:org.apache.beam:kafka_read_without_metadata:v1"
EDIT: Im getting the bootstrap server option from here
My mistake was that I was skippig the step where I have to start a expansion_service, I did that with this command
java -jar beam-sdks-java-io-expansion-service-2.37.0.jar 8088 --javaClassLookupAllowlistFile='*'
after downloading the beam-sdks-java-io-expansion-service-2.37.0.jar from https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-expansion-service/2.36.0
and then specifying the port in expansion_service='localhost:8088'
Then I had two minor mistakes one was that I was using the JDK 18 and I think it wasnt compatible https://beam.apache.org/get-started/quickstart-java/ so I switched to JDK 17 and used python 3.8 instead of python 3.10
I have the following code to run sql transformations in apache beam in direct runner on windows.
import apache_beam as beam
from apache_beam.transforms.sql import SqlTransform
with beam.Pipeline() as p:
pipe = (
p
|'hello' >> beam.Create([('SE',400),('SC',500)])
|'schema' >> beam.Map(lambda x: beam.Row(
state=x[0],
population=x[1]
))
)
sql = (
pipe
|'sql' >> SqlTransform('SELECT state, population FROM PCOLLECTION')
|'sql print' >> beam.Map(print)
)
And I get the following error:
File "c:\users\XXX\appdata\local\programs\python\python37\lib\subprocess.py", line 1306, in send_signal
raise ValueError("Unsupported signal: {}".format(sig))
ValueError: Unsupported signal: 2
I saw on youtube experts saying about Universal Local Runner, but I didn't find how to install it.
Anyone can help me?
Thank you in advance
The issue is now tracked in https://issues.apache.org/jira/browse/BEAM-12501.
I have no issue running your pipeline with source code built from Beam head. But I'm on Mac OS. What version of Beam did you use?
Could you try asking about it to the user#beam.apache.org? This might be an issue specific in the Windows system.
I am trying to integtrate Spark with Kafka. I am having the kafka consumer have json data. I want to inegrate the kafka consumer with Spark for processing. When I run the below code error is throwing.
bin\spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.0.0 test.py localhost:9092 maktest
My test.py is below
import sys
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
if __name__ == "__main__":
sc = SparkContext(appName="PythonStreamingDirectKafkaWordCount")
ssc = StreamingContext(sc, 2)
brokers, topic = sys.argv[1:]
kvs = KafkaUtils.createDirectStream(ssc,[topic],{"metadata.broker.list": brokers})
lines = kvs.map(lambda x: x[1])
print (lines)
ssc.start()
ssc.awaitTermination()
I got the error below
18/12/10 16:41:40 INFO VerifiableProperties: Verifying properties
18/12/10 16:41:40 INFO VerifiableProperties: Property group.id is overridden to
18/12/10 16:41:40 INFO VerifiableProperties: Property zookeeper.connect is overridden to
<pyspark.streaming.kafka.KafkaTransformedDStream object at 0x000002A6DA9FE6A0>
18/12/10 16:41:40 ERROR StreamingContext: Error starting the context, marking it as stopped
java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
at scala.Predef$.require(Predef.scala:224)
Traceback (most recent call last):
File "C:/Users/maws/Desktop/spark-2.2.1-bin-hadoop2.7/test.py", line 12, in <module>
ssc.start()
py4j.protocol.Py4JJavaError: An error occurred while calling o25.start.
: java.lang.IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute
You're not using a supported Spark Streaming DStream output operation.
For the pyspark API, you should use:
pprint()
saveAsTextFiles(prefix, [suffix])
saveAsObjectFiles(prefix, [suffix])
saveAsHadoopFiles(prefix, [suffix])
foreachRDD(func)
print() can't be used with pyspark, so make sure when you check other Spark Streaming Examples for Scala or Java, you change to pprint()