ReadFromKafka with python in apache-beam Unsupported signal: 2 - python

I´ve been strugglin making this work, I know this is a cross-language transform and all of that and I installed the Java jdk on my pc (when I write java -version on cmd I get correct information and all of that) but when I am trying to make a simple pipeline work:
import apache_beam as beam
from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS']='credentialsOld.json'
def main():
print('======================================================')
beam_options = PipelineOptions(runner='DataflowRunner',temp_location=temp_location,staging_location=staging_location,project=project,experiments=['use_runner_v2'],streaming=True)
with beam.Pipeline(options=beam_options) as p:
msgs = p | 'ReadKafka' >> ReadFromKafka(consumer_config={'bootstrap.servers':'xxxxx-xxxxx...','group_id':'testAB'},topics=['users'])
msgs | beam.FlatMap(print)
if __name__ == '__main__':
main()
I get this error: ValueError: Unsupported signal: 2
I have tried adding the parameter expansion_service= 'beam:external:java:kafka:read:v1' to the ReadFromKafka but then I get:
status = StatusCode.UNAVAILABLE
details = "DNS resolution failed for
beam:external:java:kafka:read:v1: UNKNOWN: OS Error"
Im working on a venv python enviroment if this info can be usefull and my kafka cluster is on confluent cloud.
Im also getting this runtime error:
RuntimeError: java.lang.RuntimeException: Failed to get dependencies of beam:transform:org.apache.beam:kafka_read_without_metadata:v1 from spec urn: "beam:transform:org.apache.beam:kafka_read_without_metadata:v1"
EDIT: Im getting the bootstrap server option from here

My mistake was that I was skippig the step where I have to start a expansion_service, I did that with this command
java -jar beam-sdks-java-io-expansion-service-2.37.0.jar 8088 --javaClassLookupAllowlistFile='*'
after downloading the beam-sdks-java-io-expansion-service-2.37.0.jar from https://mvnrepository.com/artifact/org.apache.beam/beam-sdks-java-io-expansion-service/2.36.0
and then specifying the port in expansion_service='localhost:8088'
Then I had two minor mistakes one was that I was using the JDK 18 and I think it wasnt compatible https://beam.apache.org/get-started/quickstart-java/ so I switched to JDK 17 and used python 3.8 instead of python 3.10

Related

Apache beam with SqlTransform in direct runner

I have the following code to run sql transformations in apache beam in direct runner on windows.
import apache_beam as beam
from apache_beam.transforms.sql import SqlTransform
with beam.Pipeline() as p:
pipe = (
p
|'hello' >> beam.Create([('SE',400),('SC',500)])
|'schema' >> beam.Map(lambda x: beam.Row(
state=x[0],
population=x[1]
))
)
sql = (
pipe
|'sql' >> SqlTransform('SELECT state, population FROM PCOLLECTION')
|'sql print' >> beam.Map(print)
)
And I get the following error:
File "c:\users\XXX\appdata\local\programs\python\python37\lib\subprocess.py", line 1306, in send_signal
raise ValueError("Unsupported signal: {}".format(sig))
ValueError: Unsupported signal: 2
I saw on youtube experts saying about Universal Local Runner, but I didn't find how to install it.
Anyone can help me?
Thank you in advance
The issue is now tracked in https://issues.apache.org/jira/browse/BEAM-12501.
I have no issue running your pipeline with source code built from Beam head. But I'm on Mac OS. What version of Beam did you use?
Could you try asking about it to the user#beam.apache.org? This might be an issue specific in the Windows system.

KafkaRecord cannot be cast to [B

I'm trying to process the data streaming from Apache Kafka using the Python SDK for Apache Beam with the Flink runner. After running Kafka 2.4.0 and Flink 1.8.3, I follow these steps:
1) Compile and run Beam 2.16 with Flink 1.8 runner.
git clone --single-branch --branch release-2.16.0 https://github.com/apache/beam.git beam-2.16.0
cd beam-2.16.0
nohup ./gradlew :runners:flink:1.8:job-server:runShadow -PflinkMasterUrl=localhost:8081 &
2) Run the Python pipeline.
from apache_beam import Pipeline
from apache_beam.io.external.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
if __name__ == '__main__':
with Pipeline(options=PipelineOptions([
'--runner=FlinkRunner',
'--flink_version=1.8',
'--flink_master_url=localhost:8081',
'--environment_type=LOOPBACK',
'--streaming'
])) as pipeline:
(
pipeline
| 'read' >> ReadFromKafka({'bootstrap.servers': 'localhost:9092'}, ['test']) # [BEAM-3788] ???
)
result = pipeline.run()
result.wait_until_finish()
3) Publish some data to Kafka.
bin/kafka-console-producer.sh --broker-list localhost:9092 --topic test
>{"hello":"world!"}
The Python script throws this error:
[flink-runner-job-invoker] ERROR org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation - Error during job invocation BeamApp-USER-somejob. org.apache.flink.client.program.ProgramInvocationException: Job failed. (JobID: xxx)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:268)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:483)
at org.apache.beam.runners.flink.FlinkExecutionEnvironments$BeamFlinkRemoteStreamEnvironment.executeRemotely(FlinkExecutionEnvironments.java:360)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:310)
at org.apache.beam.runners.flink.FlinkStreamingPortablePipelineTranslator$StreamingTranslationContext.execute(FlinkStreamingPortablePipelineTranslator.java:173)
at org.apache.beam.runners.flink.FlinkPipelineRunner.runPipelineWithTranslator(FlinkPipelineRunner.java:104)
at org.apache.beam.runners.flink.FlinkPipelineRunner.run(FlinkPipelineRunner.java:80)
at org.apache.beam.runners.fnexecution.jobsubmission.JobInvocation.runPipeline(JobInvocation.java:78)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:57)
at org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed.
at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:265)
... 13 more
Caused by: java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
at org.apache.beam.sdk.coders.ByteArrayCoder.encode(ByteArrayCoder.java:41)
at org.apache.beam.sdk.coders.LengthPrefixCoder.encode(LengthPrefixCoder.java:56)
at org.apache.beam.sdk.values.ValueWithRecordId$ValueWithRecordIdCoder.encode(ValueWithRecordId.java:105)
at org.apache.beam.sdk.values.ValueWithRecordId$ValueWithRecordIdCoder.encode(ValueWithRecordId.java:81)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:578)
at org.apache.beam.sdk.util.WindowedValue$FullWindowedValueCoder.encode(WindowedValue.java:529)
at org.apache.beam.sdk.util.CoderUtils.encodeToSafeStream(CoderUtils.java:82)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:66)
at org.apache.beam.sdk.util.CoderUtils.encodeToByteArray(CoderUtils.java:51)
at org.apache.beam.sdk.util.CoderUtils.clone(CoderUtils.java:141)
at org.apache.beam.runners.flink.translation.types.CoderTypeSerializer.copy(CoderTypeSerializer.java:67)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.pushToOperator(OperatorChain.java:577)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:554)
at org.apache.flink.streaming.runtime.tasks.OperatorChain$CopyingChainingOutput.collect(OperatorChain.java:534)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:718)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.collect(AbstractStreamOperator.java:696)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:305)
at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:394) at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.emitElement(UnboundedSourceWrapper.java:341)
at org.apache.beam.runners.flink.translation.wrappers.streaming.io.UnboundedSourceWrapper.run(UnboundedSourceWrapper.java:283)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:93)
at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:57)
at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:97)
at org.apache.flink.streaming.runtime.tasks.StoppableSourceStreamTask.run(StoppableSourceStreamTask.java:45)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:302)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:711)
... 1 more
ERROR:root:java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.BeamFileSystemArtifactRetrievalService - Manifest at/tmp/artifacts0k1mnin0/somejob/MANIFEST has 0 artifact locations
[flink-runner-job-invoker] INFO org.apache.beam.runners.fnexecution.artifact.BeamFileSystemArtifactStagingService - Removed dir /tmp/artifacts0k1mnin0/job_somejob/
Traceback (most recent call last):
File "main.py", line 40, in <module>
run()
File "main.py", line 37, in run
result.wait_until_finish()
File "/home/USER/beam/lib/python3.5/site-packages/apache_beam/runners/portability/portable_runner.py", line 439, in wait_until_finish self._job_id, self._state, self._last_error_message()))
RuntimeError: Pipeline BeamApp-USER-somejob failed in state FAILED: java.lang.ClassCastException: org.apache.beam.sdk.io.kafka.KafkaRecord cannot be cast to [B
I tried other deserializers available in Kafka but they did not work: Couldn't infer Coder from class org.apache.kafka.common.serialization.StringDeserializer. This error is originating from this piece of code.
Am I doing something wrong?
Disclaimer: this is my first encounter with Apache Beam project.
It seems that Kafka consumer support is quite fresh thing in Beam (at least in Python interface) according to this JIRA. Apparently, it seems that there is still problem with FlinkRunner combined with this new API. Even though your code is technically correct it will not run correctly on Flink. There is a patch available which seems more like a quickfix than final solution to me. It requires recompilation and thus is not something I would propose using on production. If you are just getting started with technology and don't want to be blocked then feel free to try it out.

Spark Release 2.4.0 can not work on win7 because of ImportError: No module named 'resource'

I attempt to install Spark Release 2.4.0 on my pc, which system is win7_x64.
However when I try to run simple code to check whether spark is ready to work:
code:
import os
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster('local[*]').setAppName('word_count')
sc = SparkContext(conf=conf)
d = ['a b c d', 'b c d e', 'c d e f']
d_rdd = sc.parallelize(d)
rdd_res = d_rdd.flatMap(lambda x: x.split(' ')).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
print(rdd_res)
print(rdd_res.collect())
I get this error:
error1
I open the worker.py file to check the code.
I find that, in version 2.4.0, the code is :
worker.py v2.4.0
However, in version 2.3.2, the code is:
worker.py v2.3.2
Then I reinstall spark-2.3.2-bin-hadoop2.7 , the code works fine.
Also, I find this question:
ImportError: No module named 'resource'
So, I think spark-2.4.0-bin-hadoop2.7 can not work in win7 because of importing
resource module in worker.py, which is a Unix specific package.
I hope someone could fix this problem in spark.
i got this error and I have spark 2.4.0, jdk 11 and kafka 2.11 on windows.
I was able to resolve this by doing -
1) cd spark_home\python\lib
ex. cd C:\myprograms\spark-2.4.0-bin-hadoop2.7\python
2) unzip pyspark.zip
3) edit worker.py , comment out 'import resource' and also following para and save the file. This para is just an add on and is not a core code, so its fine to comment it out.
4)remove the older pyspark.zip and create new zip.
5) in jupyter notebook restart the kernel.
commented para in worker.py -
# set up memory limits
#memory_limit_mb = int(os.environ.get('PYSPARK_EXECUTOR_MEMORY_MB', "-1"))
#total_memory = resource.RLIMIT_AS
#try:
# if memory_limit_mb > 0:
#(soft_limit, hard_limit) = resource.getrlimit(total_memory)
#msg = "Current mem limits: {0} of max {1}\n".format(soft_limit, hard_limit)
#print(msg, file=sys.stderr)
# convert to bytes
#new_limit = memory_limit_mb * 1024 * 1024
#if soft_limit == resource.RLIM_INFINITY or new_limit < soft_limit:
# msg = "Setting mem limits to {0} of max {1}\n".format(new_limit, new_limit)
# print(msg, file=sys.stderr)
# resource.setrlimit(total_memory, (new_limit, new_limit))
#except (resource.error, OSError, ValueError) as e:
# # not all systems support resource limits, so warn instead of failing
# print("WARN: Failed to set memory limit: {0}\n".format(e), file=sys.stderr)
Python has some compatibility issue with the newly released Spark 2.4.0 version. I also faced this similar issue. If you download and configure Spark 2.3.2 in your system (change environment variables), the problem will be resolved.

Executable out of script containing serial_for_url

I have developed a python script for making a serial communication to a digital pump. I now need to make an executable out of it. However even though it works perfectly well when running it with python and py2exe does produce the .exe properly when I run the executable the following error occurs:
File: pump_model.pyc in line 96 in connect_new
File: serial\__init__.pyc in line 71 in serial_for_url
ValueError: invalid URL protocol 'loop' not known
The relevant piece of my code is the following:
# New serial connection
def connect_new(self, port_name):
"""Function for configuring a new serial connection."""
try:
self.ser = serial.Serial(port = port_name,\
baudrate = 9600,\
parity = 'N',\
stopbits = 1,\
bytesize = 8,\
timeout = self.timeout_time)
except serial.SerialException:
self.ser = serial.serial_for_url('loop://',\
timeout = self.timeout_time) # This line BLOWS!
except:
print sys.exc_info()[0]
finally:
self.initialize_pump()
I should note that the application was written in OSX and was tested on Windows with the Canopy Python Distribution.
I had the exact same problem with "socket://" rather than "loop://"
I wasn't able to get the accepted answer to work however the following seems to succeed:
1) Add an explicit import of the offending urlhandler.* module
import serial
# explicit import for py2exe - to fix "socket://" url issue
import serial.urlhandler.protocol_socket
# explicit import for py2exe - to fix "loop://" url issue (OP's particular prob)
import serial.urlhandler.protocol_loop
# use serial_for_url in normal manner
self._serial = serial.serial_for_url('socket://192.168.1.99:12000')
2) Generate a setup script for py2exe (see https://pypi.python.org/pypi/py2exe/) -- I've installed py2exe to a virtualenv:
path\to\env\Scripts\python.exe -m py2exe myscript.py -W mysetup.py
3) edit mysetup.py to include option
zipfile="library.zip" # default generated value is None
(see also http://www.py2exe.org/index.cgi/ListOfOptions)
3) build it:
path\to\env\Scripts\python.exe mysetup.py py2exe
4) run it
dist\myscript.exe
Found it!
It seems that for some reason the 'loop://' arguement can't be recognised after the .exe production.
I figured out by studying the pyserial/init.py script that when issuing the command serial.serial_for_url(‘loop://') you essentially call:
sys.modules['serial.urlhandler.protocol_loop’].Serial(“loop://“)
So you have to first import the serial.urlhandler.protocol_loop
and then issue that command in place of the one malfunctioning.
So you can now type:
__import__('serial.urlhandler.protocol_loop')
sys.modules[‘serial.urlhandler.protocol_loop’].Serial("loop://")
After this minor workaround it worked fine.

Pig streaming through python script with import modules

Working with pigtmp$ pig --version
Apache Pig version 0.8.1-cdh3u1 (rexported)
compiled Jul 18 2011, 08:29:40
I have a python script (c-python), which imports another script, both very simple in my example:
DATA
example$ hadoop fs -cat /user/pavel/trivial.log
1 one
2 two
3 three
EXAMPLE WITHOUT INCLUDE - works fine
example$ pig -f trivial_stream.pig
(1,1,one)
()
(1,2,two)
()
(1,3,three)
()
where
1) trivial_stream.pig:
DEFINE test_stream `test_stream.py` SHIP ('test_stream.py');
A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray);
C = STREAM A THROUGH test_stream;
DUMP C;
2) test_stream.py
#! /usr/bin/env python
import sys
import string
for line in sys.stdin:
if len(line) == 0: continue
new_line = line
print "%d\t%s" % (1, new_line)
So essentially I just aggregate lines with one key, nothing special.
EXAMPLE WITH INCLUDE - bombs!
Now I'd like to append a string from a python import module which sits in the same directory as test_stream.py. I've tried to ship the import module in many different ways but get the same error (see below)
1) trivial_stream.pig:
DEFINE test_stream `test_stream.py` SHIP ('test_stream.py', 'test_import.py');
A = LOAD 'trivial.log' USING PigStorage('\t') AS (mynum: int, mynumstr: chararray);
C = STREAM A THROUGH test_stream;
DUMP C;
2) test_stream.py
#! /usr/bin/env python
import sys
import string
import test_import
for line in sys.stdin:
if len(line) == 0: continue
new_line = ("%s-%s") % (line.strip(), test_import.getTestLine())
print "%d\t%s" % (1, new_line)
3) test_import.py
def getTestLine():
return "test line";
Now
example$ pig -f trivial_stream.pig
Backend error message
org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:265)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.cleanup(PigMapBase.java:103)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323)
at org.apache.hadoop.mapred.Child$4.run(Child.java:270)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127)
at org.apache.hadoop.mapred.Child.main(Child.java:264)
Pig Stack Trace
ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
org.apache.pig.impl.logicalLayer.FrontendException: ERROR 1066: Unable to open iterator for alias C. Backend error : Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
at org.apache.pig.PigServer.openIterator(PigServer.java:753)
at org.apache.pig.tools.grunt.GruntParser.processDump(GruntParser.java:615)
at org.apache.pig.tools.pigscript.parser.PigScriptParser.parse(PigScriptParser.java:303)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:168)
at org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:144)
at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:90)
at org.apache.pig.Main.run(Main.java:396)
at org.apache.pig.Main.main(Main.java:107)
Caused by: org.apache.pig.backend.executionengine.ExecException: ERROR 2997: Unable to recreate exception from backed error: org.apache.pig.backend.executionengine.ExecException: ERROR 2055: Received Error while processing the map plan: 'test_stream.py ' failed with exit status: 1
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getErrorMessages(Launcher.java:221)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.Launcher.getStats(Launcher.java:151)
at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher.launchPig(MapReduceLauncher.java:337)
at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.execute(HExecutionEngine.java:382)
at org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1209)
at org.apache.pig.PigServer.storeEx(PigServer.java:885)
at org.apache.pig.PigServer.store(PigServer.java:827)
at org.apache.pig.PigServer.openIterator(PigServer.java:739)
... 7 more
Thanks you much for your help!
-Pavel
Correct answer from comment above:
The dependencies aren't shipped, if you want your python app to work with pig you need to tar it (don't forget init.py's!), then include the .tar file in pig's SHIP statement. The first thing you do is untar the app. There might be issues with paths, so I'd suggest the following even before tar extraction: sys.path.insert(0, os.getcwd()).
You need to append the current directory to sys.path in your test_stream.py:
#! /usr/bin/env python
import sys
sys.path.append(".")
Thus the SHIP command you had there does ship the python script, but you just need to tell Python where to look.

Categories