Currently I try to get the hang of apache beam together with apache kafka.
The Kafka service is running (locally) and I write with the kafka-console-producer some test messages.
First I wrote this Java Codesnippet to test apache beam with a language that I know. And it works as expected.
public class Main {
public static void main(String[] args) {
Pipeline pipeline = Pipeline.create();
Read<Long, String> kafkaReader = KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("beam-test")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class);
kafkaReader.withoutMetadata();
pipeline
.apply("Kafka", kafkaReader
).apply(
"Extract words", ParDo.of(new DoFn<KafkaRecord<Long, String>, String>() {
#ProcessElement
public void processElement(ProcessContext c){
System.out.println("Key:" + c.element().getKV().getKey() + " | Value: " + c.element().getKV().getValue());
}
})
);
pipeline.run();
}
}
My goal is to write that same in python and this is what I´m currently at:
def run_pipe():
with beam.Pipeline(options=PipelineOptions()) as p:
(p
| 'Kafka Unbounded' >> ReadFromKafka(consumer_config={'bootstrap.servers' : 'localhost:9092'}, topics=['beam-test'])
| 'Test Print' >> beam.Map(print)
)
if __name__ == '__main__':
run_pipe()
Now to the problem. When I try to run the python code, I get the following error:
(app) λ python ArghKafkaExample.py
Traceback (most recent call last):
File "ArghKafkaExample.py", line 22, in <module>
run_pipe()
File "ArghKafkaExample.py", line 10, in run_pipe
(p
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\transforms\ptransform.py", line 1028, in __ror__
return self.transform.__ror__(pvalueish, self.label)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\transforms\ptransform.py", line 572, in __ror__
result = p.apply(self, pvalueish, label)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\pipeline.py", line 648, in apply
return self.apply(transform, pvalueish)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\pipeline.py", line 691, in apply
pvalueish_result = self.runner.apply(transform, pvalueish, self._options)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\runners\runner.py", line 198, in apply
return m(transform, input, options)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\runners\runner.py", line 228, in apply_PTransform
return transform.expand(input)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\transforms\external.py", line 322, in expand
self._expanded_components = self._resolve_artifacts(
File "C:\Users\gamef\AppData\Local\Programs\Python\Python38\lib\contextlib.py", line 120, in __exit__
next(self.gen)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\transforms\external.py", line 372, in _service
yield stub
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\transforms\external.py", line 523, in __exit__
self._service_provider.__exit__(*args)
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\utils\subprocess_server.py", line 74, in __exit__
self.stop()
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\utils\subprocess_server.py", line 133, in stop
self.stop_process()
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\utils\subprocess_server.py", line 179, in stop_process
return super(JavaJarServer, self).stop_process()
File "C:\Users\gamef\git\BeamMeScotty\app\lib\site-packages\apache_beam\utils\subprocess_server.py", line 143, in stop_process
self._process.send_signal(signal.SIGINT)
File "C:\Users\gamef\AppData\Local\Programs\Python\Python38\lib\subprocess.py", line 1434, in send_signal
raise ValueError("Unsupported signal: {}".format(sig))
ValueError: Unsupported signal: 2
From googling I found out, that it has something to do with program exit codes (like Strg+C), but overall I have absolut no idea what the problem is.
Any advice would be helpful!
Greetings Pascal
Your pipeline code seems correct here. The issue is due to the requirements of the Kafka IO in the Python SDK. From the module documentation:
These transforms are currently supported by Beam portable runners (for example, portable Flink and Spark) as well as Dataflow runner.
Transforms provided in this module are cross-language transforms implemented in the Beam Java SDK. During the pipeline construction, Python SDK will connect to a Java expansion service to expand these transforms. To facilitate this, a small amount of setup is needed before using these transforms in a Beam Python pipeline.
Kafka IO is implemented in Python as a cross-language transform in Java and your pipeline is failing because you haven't set up your environment to execute cross-language transforms. To explain what a cross-language transform is in layman's terms: it means that the Kafka transform is actually executing on the Java SDK rather than the Python SDK, so it can make use of the existing Kafka code on Java.
There are two barriers preventing your pipeline from working. The easier one to fix is that only the runners I quoted above support cross-language transforms, so if you're running this pipeline with the Direct runner it won't work, you'll want to switch to either the Flink or Spark runner in local mode.
The more tricky barrier is that you need to start up an Expansion Service to be able to add external transforms to your pipeline. The stacktrace you're getting is happening because Beam is attempting to expand the transform but is unable to connect to the expansion service, and the expansion fails.
If you still want to try running this with cross-language despite the extra setup, the documentation I linked contains instructions for running an expansion service. At the time I am writing this answer this feature is still new, and there might be blind spots in the documentation. If you run into problems, I encourage you to ask questions on the Apache Beam users mailing list or Apache Beam slack channel.
Related
In this document, Apache Beam suggests the deadletter pattern when writing to BigQuery. This pattern allows you to fetch rows that failed to be written from the transform output with the 'FailedRows' tag.
However, when I try to use it:
WriteToBigQuery(
table=self.bigquery_table_name,
schema={"fields": self.bigquery_table_schema},
method=WriteToBigQuery.Method.FILE_LOADS,
temp_file_format=FileFormat.AVRO,
)
A schema mismatch in one of my elements causes the following exception:
Error message from worker: Traceback (most recent call last):
File
"/my_code/apache_beam/io/gcp/bigquery_tools.py", line 1630,
in write self._avro_writer.write(row) File "fastavro/_write.pyx", line 647,
in fastavro._write.Writer.write File "fastavro/_write.pyx", line 376,
in fastavro._write.write_data File "fastavro/_write.pyx", line 320,
in fastavro._write.write_record File "fastavro/_write.pyx", line 374,
in fastavro._write.write_data File "fastavro/_write.pyx", line 283,
in fastavro._write.write_union ValueError: [] (type <class 'list'>) do not match ['null', 'double'] on field safety_proxy During handling of the above exception, another exception occurred: Traceback (most recent call last): File "apache_beam/runners/common.py", line 1198,
in apache_beam.runners.common.DoFnRunner.process File "apache_beam/runners/common.py", line 718,
in apache_beam.runners.common.PerWindowInvoker.invoke_process File "apache_beam/runners/common.py", line 841,
in apache_beam.runners.common.PerWindowInvoker._invoke_process_per_window File "apache_beam/runners/common.py", line 1334,
in apache_beam.runners.common._OutputProcessor.process_outputs File "/my_code/apache_beam/io/gcp/bigquery_file_loads.py", line 258,
in process writer.write(row) File "/my_code/apache_beam/io/gcp/bigquery_tools.py", line 1635,
in write ex, self._avro_writer.schema, row)).with_traceback(tb) File "/my_code/apache_beam/io/gcp/bigquery_tools.py", line 1630,
in write self._avro_writer.write(row) File "fastavro/_write.pyx", line 647,
in fastavro._write.Writer.write File "fastavro/_write.pyx", line 376,
in fastavro._write.write_data File "fastavro/_write.pyx", line 320,
in fastavro._write.write_record File "fastavro/_write.pyx", line 374,
in fastavro._write.write_data File "fastavro/_write.pyx", line 283,
in fastavro._write.write_union ValueError: Error writing row to Avro: [] (type <class 'list'>) do not match ['null', 'double'] on field safety_proxy Schema: ...
From what I gather, the schema mismatch causes fastavro._write.Writer.write to fail and throw an exception. Instead, I would like WriteToBigQuery to apply the deadletter behavior and return my malformed rows as FailedRows tagged output. Is there a way to achieve this?
Thanks
EDIT: Adding more detailed example of what I'm trying to do:
from apache_beam import Create
from apache_beam.io.gcp.bigquery import BigQueryWriteFn, WriteToBigQuery
from apache_beam.io.textio import WriteToText
...
valid_rows = [{"some_field_name": i} for i in range(1000000)]
invalid_rows = [{"wrong_field_name": i}]
pcoll = Create(valid_rows + invalid_rows)
# This fails because of the 1 invalid row
write_result = (
pcoll
| WriteToBigQuery(
table=self.bigquery_table_name,
schema={
"fields": [
{'name': 'some_field_name', 'type': 'INTEGER', 'mode': 'NULLABLE'},
]
},
method=WriteToBigQuery.Method.FILE_LOADS,
temp_file_format=FileFormat.AVRO,
)
)
# What I want is for WriteToBigQuery to partially succeed and output the failed rows.
# This is because I have pipelines that run for multiple hours and fail because of
# a small amount of malformed rows
(
write_result[BigQueryWriteFn.FAILED_ROWS]
| WriteToText('gs://my_failed_rows/')
)
You can use a dead letter queue in the pipeline instead of let BigQuery catch errors for you.
Beam proposes a native way for error handling and dead letter queue with TupleTags but the code is little verbose.
I created an open source library called Asgarde for Python sdk and Java sdk to apply error handling for less code, more concise and expressive code :
https://github.com/tosun-si/pasgarde
(also the Java version : https://github.com/tosun-si/asgarde)
You can install it with pip :
asgarde==0.16.0
pip install asgarde==0.16.0
from apache_beam import Create
from apache_beam.io.gcp.bigquery import BigQueryWriteFn, WriteToBigQuery
from apache_beam.io.textio import WriteToText
from asgarde.collection_composer import CollectionComposer
def validate_row(self, row) -> Dict :
field = row['your_field']
if field is None or field == '':
# You can raise your own custom exception
raise ValueError('Bad field')
...
valid_rows = [{"some_field_name": i} for i in range(1000000)]
invalid_rows = [{"wrong_field_name": i}]
pcoll = Create(valid_rows + invalid_rows)
# Dead letter queue proposed by Asgarde, it's return output and Failure PCollection.
output_pcoll, failure_pcoll = (CollectionComposer.of(pcoll)
.map(self.validate_row))
# Good sink
(
output_pcoll
| WriteToBigQuery(
table=self.bigquery_table_name,
schema={
"fields": [
{'name': 'some_field_name', 'type': 'INTEGER', 'mode': 'NULLABLE'},
]
},
method=WriteToBigQuery.Method.FILE_LOADS
)
)
# Bad sink : PCollection[Failure] / Failure contains inputElement and
# stackTrace.
(
failure_pcoll
| beam.Map(lambda failure : self.your_failure_transformation(failure))
| WriteToBigQuery(
table=self.bigquery_table_name,
schema=your_schema_for_failure_table,
method=WriteToBigQuery.Method.FILE_LOADS
)
)
The structure of Failure object proposed by Asgarde lib :
#dataclass
class Failure:
pipeline_step: str
input_element: str
exception: Exception
In the validate_row function, you will apply your validation logic and detect bad fields.
You will raise an exception in this case, and Asgarde will catch the error for you.
The result of CollectionComposer flow is :
PCollection of output, in this case, I think is a PCollection[Dict]
PCollection[Failure]
At the end you can process to multi sink :
Write good outputs to Bigquery
Write failures to Bigquery
You can also apply the same logic with native Beam error handling and TupleTags, I proposed an exemple in a project from my Github repository :
https://github.com/tosun-si/teams-league-python-dlq-native-beam-summit/blob/main/team_league/domain_ptransform/team_stats_transform.py
Let's step back slightly on aims and outcomes desired.
Why is "FILE_LOADS" required as a bigquery write method?
https://beam.apache.org/releases/javadoc/2.29.0/org/apache/beam/sdk/io/gcp/bigquery/BigQueryIO.Write.Method.html
Are you also aware of the BigQuery Storage Write API: https://cloud.google.com/bigquery/docs/write-api
It looks like the java sdk supports the BQ Write API, but not currently the python sdk. I believe using the write API would connect over gRPC to write into BigQuery, rather than needing to serialize to avro to then call the [ legacy ] batch load process?
Perhaps take a look and see if that helps -- schemas are important, but it seems AVRO is irrelevant to your aims and in there just because of the code you are calling?
I'm trying to communicate with an OPC DA server and need to write in a tag which is in an array format. We can connect with a simulation server, read tags (int, real, array) and write tags (int, real, str). The problem comes when we need to write in an array tag. The developper of the OpenOPC library (Barry Barnreiter) recommand to use a VARIANT variable because OPC "expect to see a Windows VARIANT structure when writing complex objects such as arrays".
I did install Pywin32 (build 217) as suggested here.
I tried to send a simple integer instead of an array in a VARIANT structure.
Here's the code:
from win32com.client import VARIANT
import pythoncom
import OpenOPC
opc_local = OpenOPC.open_client()
opc_local.connect('Matrikon.OPC.Simulation','localhost')
values = VARIANT(pythoncom.VT_ARRAY | pythoncom.VT_R8, [1.0, 2.0, 3.0, 4.0, 5.0, 6.0])
w = opc_local.write(('Bucket Brigade.ArrayOfReal8', values))
print(w)
Here's the error that we get when the line with opc_local.write gets executed:
AttributeError: 'module' object has no attribute 'VARIANT'
Here's the entire traceback:
runfile('C:/Users/nadmin/Downloads/sanstitre0.py', wdir='C:/Users/nadmin/Downloads')
Traceback (most recent call last):
File "<ipython-input-5-6799f41ab928>", line 1, in <module>
runfile('C:/Users/nadmin/Downloads/sanstitre0.py', wdir='C:/Users/nadmin/Downloads')
File "C:\Users\nadmin\AppData\Local\Continuum\anaconda2\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\nadmin\AppData\Local\Continuum\anaconda2\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 95, in execfile
exec(compile(scripttext, filename, 'exec'), glob, loc)
File "C:/Users/nadmin/Downloads/sanstitre0.py", line 14, in <module>
w = opc_local.write(('Bucket Brigade.ArrayOfReal8', values))
File "C:\Users\nadmin\AppData\Local\Continuum\anaconda2\lib\site-packages\Pyro\core.py", line 381, in __call__
return self.__send(self.__name, args, kwargs)
File "C:\Users\nadmin\AppData\Local\Continuum\anaconda2\lib\site-packages\Pyro\core.py", line 456, in _invokePYRO
return self.adapter.remoteInvocation(name, Pyro.constants.RIF_VarargsAndKeywords, vargs, kargs)
File "C:\Users\nadmin\AppData\Local\Continuum\anaconda2\lib\site-packages\Pyro\protocol.py", line 497, in remoteInvocation
return self._remoteInvocation(method, flags, *args)
File "C:\Users\nadmin\AppData\Local\Continuum\anaconda2\lib\site-packages\Pyro\protocol.py", line 572, in _remoteInvocation
answer.raiseEx()
File "C:\Users\nadmin\AppData\Local\Continuum\anaconda2\lib\site-packages\Pyro\errors.py", line 72, in raiseEx
raise self.excObj
And here's the configuration of the computer:
Windows 10
Python 2.7
Pyro 3.16
Pywin32 Build 223
OpenOPC 1.3.1 win32-py27
You have to change your line opc_local = OpenOPC.open_client() for opc_local = OpenOPC.client(). This will make you connect directly to the OPC server, as opposed to using the OpenOPC Gateway Service.
The VARIANT structure is not included inside the Gateway Service exe. Note that the Gateway Service exe is it's own frozen Python distribution. Thus it only includes the Python modules inside it that it needs to run and nothing else. So by avoiding using the Gateway Service you should not have this problem since you'll be executing your code entirely using the Python distribution that you installed yourself on your PC.
According to Python COM server throws 'module' object has no attribute 'VARIANT', the VARIANT class was introduced in Pywin32 build 217.
As you have included in your post that you have Pywin32 Build 223, this should not be a problem. But to be sure, from this list of available downloads: Home / pywin32 / Build 217, I would specifically select pywin32-217.win-amd64-py2.7.exe.
If that doesn't work, I would suggest checking the source of the configuration you listed; Do you only have one version of python installed? Perhaps you have multiple Python IDEs that could get mixed up? These are some common cases that can cause confusion in fixing bugs.
You need to upgrade the python to 3.9 and Pywin32 to Build 302. In addition, you need to install the OpenOPC-Python3x 1.3.1.
Dataflow pipeline with runtime arguments runs well using DirectRunner, but encounters argument error when switching to DataflowRunner.
File "/home/user/miniconda3/lib/python3.8/site-packages/apache_beam/options/pipeline_options.py", line 124, in add_value_provider_argument
self.add_argument(*args, **kwargs)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1386, in add_argument
return self._add_action(action)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1749, in _add_action
self._optionals._add_action(action)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1590, in _add_action
action = super(_ArgumentGroup, self)._add_action(action)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1400, in _add_action
self._check_conflict(action)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1539, in _check_conflict
conflict_handler(action, confl_optionals)
File "/home/user/miniconda3/lib/python3.8/argparse.py", line 1548, in _handle_conflict_error
raise ArgumentError(action, message % conflict_string)
argparse.ArgumentError: argument --bucket_input: conflicting option string: --bucket_input
Here is how the argument defined and called
class CustomPipelineOptions(PipelineOptions):
#classmethod
def _add_argparse_args(cls, parser):
parser.add_value_provider_argument(
'--bucket_input',
default="device-file-dev",
help='Raw device file bucket')
pipeline = beam.Pipeline(options=pipeline_options)
custom_options = pipeline_options.view_as(CustomPipelineOptions)
_ = (
pipeline
| 'Initiate dataflow' >> beam.Create(["Start"])
| 'Create P collection with file paths' >> beam.ParDo(
CreateGcsPCol(input_bucket=custom_options.bucket_input)
)
Notice this only happens with DataflowRunner. Anyone knows how to solve it? Thanks a lot.
Copying the answer from the comment here:
The error is caused by importing a local Python sub-module via a relative path. With the DirectRunner, the relative path works because it's on the local machine. However, the DataflowRunner is on a different machine (GCE Instance) and needs the absolute path. Thus, the problem was solved by installing both the Dataflow pipeline module, the sub-module, and importing from the installed sub-module -- instead of using the relative path.
Im having trouble getting identity-toolkit fully working with Python App Engine Sandbox. The sample provided is for a non GAE Sandbox project.
In the sample project it reads gitkit-server-config.json from file using os.path. But this is not supported in GAE Sandbox. To get around this I am creating a GitkitClient directly using the constructor:
gitkit_instance = gitkitclient.GitkitClient(
client_id="123456opg.apps.googleusercontent.com",
service_account_email="my-project#appspot.gserviceaccount.com",
service_account_key="/path/to/my-p12file.p12",
widget_url="http://localhost:8080/callback",
http=None,
project_id="my-project")
Is this the correct way to create the GitkitClient?
The issue now is when I try to do a password reset when running locally using dev_appserver.py I get the following stack trace:
File "dashboard.py", line 89, in post
oobResult = gitkit_instance.GetOobResult(self.request.POST,self.request.remote_addr)
File "identitytoolkit/gitkitclient.py", line 366, in GetOobResult
param['action'])
File "identitytoolkit/gitkitclient.py", line 435, in _BuildOobLink
code = self.rpc_helper.GetOobCode(param)
File "identitytoolkit/rpchelper.py", line 104, in GetOobCode
response = self._InvokeGitkitApi('getOobConfirmationCode', request)
File "identitytoolkit/rpchelper.py", line 210, in _InvokeGitkitApi
access_token = self._GetAccessToken()
File "identitytoolkit/rpchelper.py", line 231, in _GetAccessToken
'assertion': self._GenerateAssertion(),
File "identitytoolkit/rpchelper.py", line 259, in _GenerateAssertion
crypt.Signer.from_string(self.service_account_key),
File "oauth2client/_pure_python_crypt.py", line 183, in from_string
raise ValueError('No key could be detected.')
ValueError: No key could be detected.
Im assuming this is a problem with the .p12 file? I double checked service_account_key="/path/to/my-p12file.p12" and the file exists. What am I missing here?
FYI to others working on this in the future -
I could not get this working in python. The documentation doesn't make it clear how to get this working in app engine. In addition, dependency issues with PyCrypto made this a gcc and dependency nightmare.
I was however able to get this working in Go and there is a semi-working example online that will work with some modifications highlighted in the issues and pull request pages. Good luck.
I'm running the demo that comes with the mapreduce framework. It's giving me an error:
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/ext/webapp/_webapp25.py", line 703, in __call__
handler.post(*groups)
File "/path/to/mapreduce/base_handler.py", line 68, in post
self.handle()
File "/path/to/mapreduce/handlers.py", line 431, in handle
self.aggregate_state(state, shard_states)
File "/path/to/mapreduce/handlers.py", line 462, in aggregate_state
context.COUNTER_MAPPER_CALLS))
File "/path/to/mapreduce/model.py", line 257, in get
return self.counters.get(counter_name, 0)
AttributeError: 'list' object has no attribute 'get'
Is this something I'm doing wrong, does the demo not work? Is there more updated code somewhere else?
This is using the code from http://appengine-mapreduce.googlecode.com/svn/trunk/
Not familiar with that code, but the latest code is the MapReduce Bundle you can download from the SDK:
https://developers.google.com/appengine/downloads
It comes with a bit of a demo. I was able to follow this and get this to work:
http://code.google.com/p/appengine-mapreduce/wiki/GettingStartedInPython
Here's some additional notes I had when I was trying to get MapReduce running.
http://eatdev.tumblr.com/post/17983355135/using-mapreduce-with-django-nonrel-on-app-engine