Confluent Python API for Kafka - python

I'm getting an error with basic usage of the official Confluent Kafka Python API:
I subscribe:
kafka_consumer.subscribe(topics=["my-avro-topic"], on_assign=on_assign_callback, on_revoke=on_revoke_callback)
Using callback:
def on_assign_callback(consumer, topic_partitions):
for topic_partition in topic_partitions:
print("without position. topic={}. partition={}. offset={}. error={}".format(topic_partition.topic, topic_partition.partition,
topic_partition.offset, topic_partition.error))
topic_partitions_with_offsets = consumer.position(topic_partitions)
print("assigned to {}->{} partitions".format(len(topic_partitions), len(topic_partitions_with_offsets)))
for topic_partition in topic_partitions_with_offsets:
print("with position. topic={}. partition={}. offset={}. error={}".format(topic_partition.topic, topic_partition.partition,
topic_partition.offset, topic_partition.error))
which produces the console output:
without position. topic=my-avro-topic. partition=0. offset=-1001. error=None
assigned to 1->1 partitions
with position. topic=my-avro-topic. partition=0. offset=-1001. error=KafkaError{code=_UNKNOWN_PARTITION,val=-190,str="(null)"}
Can someone explain this? Why would I get a callback notification on an unknown partition? Similar code works perfectly using the Java API.

This is a bug in the underlying C library librdkafka.
See the upstream issue.
If you want to start consuming from the stored offsets you don't actually need to call position() to retrieve them, the client will do so automatically if you don't change the default offset of -1001.

Related

How to connect kafka IO from apache beam to a cluster in confluent cloud

I´ve made a simple pipeline in Python to read from kafka, the thing is that the kafka cluster is on confluent cloud and I am having some trouble conecting to it.
Im getting the following log on the dataflow job:
Caused by: org.apache.kafka.common.KafkaException: Failed to construct kafka consumer
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:820)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:631)
at org.apache.kafka.clients.consumer.KafkaConsumer.<init>(KafkaConsumer.java:612)
at org.apache.beam.sdk.io.kafka.KafkaIO$Read$GenerateKafkaSourceDescriptor.processElement(KafkaIO.java:1495)
Caused by: java.lang.IllegalArgumentException: Could not find a 'KafkaClient' entry in the JAAS configuration. System property 'java.security.auth.login.config' is not set
So I think Im missing something while passing the config since it mentions something related to it, Im really new to all of this and I know nothing about java so I dont know how to proceed even reading the JAAS documentation.
The code of the pipeline is the following:
import apache_beam as beam
from apache_beam.io.kafka import ReadFromKafka
from apache_beam.options.pipeline_options import PipelineOptions
import os
import json
import logging
os.environ['GOOGLE_APPLICATION_CREDENTIALS']='credentialsOld.json'
with open('cluster.configuration.json') as cluster:
data=json.load(cluster)
cluster.close()
def logger(element):
logging.INFO('Something was found')
def main():
config={
"bootstrap.servers":data["bootstrap.servers"],
"security.protocol":data["security.protocol"],
"sasl.mechanisms":data["sasl.mechanisms"],
"sasl.username":data["sasl.username"],
"sasl.password":data["sasl.password"],
"session.timeout.ms":data["session.timeout.ms"],
"auto.offset.reset":"earliest"
}
print('======================================================')
beam_options = PipelineOptions(runner='DataflowRunner',project='project',experiments=['use_runner_v2'],streaming=True,save_main_session=True,job_name='kafka-stream-test')
with beam.Pipeline(options=beam_options) as p:
msgs = p | 'ReadKafka' >> ReadFromKafka(consumer_config=config,topics=['users'],expansion_service="localhost:8088")
msgs | beam.FlatMap(logger)
if __name__ == '__main__':
main()
I read something about passing a property java.security.auth.login.config in the config dictionary but since that example is with java and I´am using python Im really lost at what I have to pass or even if that´s the property I have to pass etc.
btw Im getting the api key and secret from here and this is what I am passing to sasl.username and sasl.password
I faced the same error the first time I tried the beam's expansion service. The key sasl.mechanisms that you are supplying is incorrect, try with sasl.mechanism also you do not need to supply the username and password since you are connection is authenticated by jasl basically the consumer_config like below worked for me:
config={
"bootstrap.servers":data["bootstrap.servers"],
"security.protocol":data["security.protocol"],
"sasl.mechanism":data["sasl.mechanisms"],
"session.timeout.ms":data["session.timeout.ms"],
"group.id":"tto",
"sasl.jaas.config":f'org.apache.kafka.common.security.plain.PlainLoginModule required serviceName="Kafka" username=\"{data["sasl.username"]}\" password=\"{data["sasl.password"]}\";',
"auto.offset.reset":"earliest"
}
I got a partial answer to this question since I fixed this problem but got into another one:
config={
"bootstrap.servers":data["bootstrap.servers"],
"security.protocol":data["security.protocol"],
"sasl.mechanisms":data["sasl.mechanisms"],
"sasl.username":data["sasl.username"],
"sasl.password":data["sasl.password"],
"session.timeout.ms":data["session.timeout.ms"],
"group.id":"tto",
"sasl.jaas.config":f'org.apache.kafka.common.security.plain.PlainLoginModule required serviceName="Kafka" username=\"{data["sasl.username"]}\" password=\"{data["sasl.password"]}\";',
"auto.offset.reset":"earliest"
}
I needed to provide the sasl.jaas.config porpertie with the api key and secret of my cluster and also the service name, however, now Im facing a different error whe running the pipeline on dataflow:
Caused by: org.apache.kafka.common.errors.TimeoutException: Timeout expired while fetching topic metadata
This error shows after 4-5 mins of trying to run the job on dataflow, actually I have no idea how to fix this but I think is related to my broker on confluent rejecting the connection, I think this could be related to the zone execution since the cluster is in a different zone than job region.
UPDATE:
I tested the code on linux/ubuntu and I dont know why but the expansión service gets downloaded automatically so you wont get unsoported signal error, still having some issues trying to autenticate to confluent kafka tho.

Flyte 0.16.2: Error loading Blob - How to get Types.Blob.fetch() to work in task decorated function?

I have a Flyte task function like this:
#task
def do_stuff(framework_obj):
framework_obj.get_outputs() # This calls Types.Blob.fetch(some_uri)
Trying to load a blob URI using flytekit.sdk.types.Types.Blob.fetch, but getting this error:
ERROR:flytekit: Exception when executing No temporary file system is present. Either call this method from within the context of a task or surround with a 'with LocalTestFileSystem():' block. Or specify a path when calling this function. Note: Cleanup is not automatic when a path is specified.
I can confirm I can load blobs using with LocalTestFileSystem(), in tests, but when actually trying to run a workflow, I'm not sure why I'm getting this error, as the function that calls blob-processing is decorated with #task so it's definitely a Flyte Task. I also confirmed that the task node exists on the Flyte web console.
What path is the error referencing and how do I call this function appropriately?
Using Flyte Version 0.16.2
Could you please give a bit more information about the code? This is flytekit version 0.15.x? I'm a bit confused since that version shouldn't have the #task decorator. It should only have #python_task which is an older API. If you want to use the new python native typing API you should install flytekit==0.17.0 instead.
Also, could you point to the documentation you're looking at? We've updated the docs a fair amount recently, maybe there's some confusion around that. These are the examples worth looking at. There's also two new Python classes, FlyteFile and FlyteDirectory that have replaced the Blob class in flytekit (though that remains what the IDL type is called).
(would've left this as a comment but I don't have the reputation to yet.)
Some code to help with fetching outputs and reading from a file output
#task
def task_file_reader():
client = SynchronousFlyteClient("flyteadmin.flyte.svc.cluster.local:81", insecure=True)
exec_id = WorkflowExecutionIdentifier(
domain="development",
project="flytesnacks",
name="iaok0qy6k1",
)
data = client.get_execution_data(exec_id)
lit = data.full_outputs.literals["o0"]
ctx = FlyteContext.current_context()
ff = TypeEngine.to_python_value(ctx, lv=lit,
expected_python_type=FlyteFile)
with open(ff, 'rb') as fh:
print(fh.readlines())

Deprecated method when making c2d com from azure IotHub to python script

message = client.receive_message()
This code is now deprecated and when searching for a solution it seems I am the only one with this issue.
I get this warning:
DeprecatedWarning: receive_message is deprecated as of 2.3.0. We
recommend that you use the .on_message_received property to set a
handler instead of message = client.receive_message()
If you have a possible solution, please post it here.
I am running the latest python 3.9 and the latest Azure IoT device library.
You're trying to use a method that's no longer supported (because it's deprecated). As the error message says, the correct way to handle C2D messages is to use an event handler. There is a good example of that here
The part you will be interested in is:
# define behavior for receiving a message
# NOTE: this could be a function or a coroutine
def message_received_handler(message):
print("the data in the message received was ")
print(message.data)
print("custom properties are")
print(message.custom_properties)
print("content Type: {0}".format(message.content_type))
print("")
# set the mesage received handler on the client
device_client.on_message_received = message_received_handler

Aerospike Python Client. UDF module to count records. Cannot register module

I am currently implementing the Aerospike Python Client in order to benchmark it along with our Redis implementation, to see which is faster and/or more stable.
I'm still on baby steps, currently Unit-Testing basic functionality, for example if I correctly add records in my Set. For that reason, I want to create a function to count them.
I saw in Aerospike's Documentation, that :
"to perform an aggregation on query, you first need to register a UDF
with the database".
It seems that this is the suggested way that aggregations, counting and other custom functionality should be run in Aerospike.
Therefore, to count the records in a set I have, I created the following module:
# "counter.lua"
function count(s)
return s : map(function() return 1 end) : reduce (function(a,b) return a+b end)
end
I'm trying to use aerospike python client's function to register a UDF(User Defined Function) module:
udf_put(filename, udf_type, policy)
My code is as follows:
# aerospike_client.py:
# "udf_put" parameters
policy = {'timeout': 1000}
lua_module = os.path.join(os.path.dirname(os.path.realpath(__file__)), "counter.lua") #same folder
udf_type = aerospike.UDF_TYPE_LUA # equals to "0", which is for "Lua"
self.client.udf_put(lua_module, udf_type, policy) # Exception is thrown here
query = self.client.query(self.aero_namespace, self.aero_set)
query.select()
result = query.apply('counter', 'count')
an exception is thrown:
exceptions.Exception: (-2L, 'Filename should be a string', 'src/main/client/udf.c', 82)
Is there anything I'm missing or doing wrong?
Is there a way to "debug" it without compiling C code?
Is there any other suggested way to count the records in my set? Or I'm fine with the Lua module?
First, I'm not seeing that exception, but I am seeing a bug with udf_put where the module is registered but the python process hangs. I can see the module appear on the server using AQL's show modules.
I opened a bug with the Python client's repo on Github, aerospike/aerospike-client-python.
There's a best practices document regarding UDF development here: https://www.aerospike.com/docs/udf/best_practices.html
In general using the stream-UDF to aggregate the records through the count function is the correct way to go about it. There are examples here and here.

Using mx:RemoteObject with web2py's #service.amfrpc decorator

I am using web2py (v1.63) and Flex 3. web2py v1.61 introduced the #service decorators, which allow you to tag a controller function with #service.amfrpc. You can then call that function remotely using http://..../app/default/call/amfrpc/[function]. See http://www.web2py.com/examples/default/tools#services. Does anybody have an example of how you would set up a Flex 3 to call a function like this? Here is what I have tried so far:
<mx:RemoteObject id="myRemote" destination="amfrpc" source="amfrpc"
endpoint="http://{mysite}/{myapp}/default/call/amfrpc/">
<mx:method name="getContacts"
result="show_results(event)"
fault="on_fault(event)" />
</mx:RemoteObject>
In my scenario, what should be the value of the destination and source attributes? I have read a couple of articles on non-web2py implementations, such as http://corlan.org/2008/10/10/flex-and-php-remoting-with-amfphp/, but they use a .../gateway.php file instead of having a URI that maps directly to the function.
Alternatively, I have been able to use flash.net.NetConnection to successfully call my remote function, but most of the documentation I have found considers this to be the old, pre-Flex 3 way of doing AMF. See http://pyamf.org/wiki/HelloWorld/Flex. Here is the NetConnection code:
gateway = new NetConnection();
gateway.connect("http://{mysite}/{myapp}/default/call/amfrpc/");
resp = new Responder(show_results, on_fault);
gateway.call("getContacts", resp);
-Rob
I have not found a way to use a RemoteObject with the #service.amfrpc decorator. However, I can use the older ActionScript code using a NetConnection (similar to what I posted originally) and pair that with a #service.amfrpc function on the web2py side. This seems to work fine. The one thing that you would want to change in the NetConnection code I shared originally, is adding an event listener for connection status. You can add more listeners if you feel the need, but I found that NetStatusEvent was a must. This status will be fired if the server is not responding. You connection set up would look like:
gateway = new NetConnection();
gateway.addEventListener(NetStatusEvent.NET_STATUS, gateway_status);
gateway.connect("http://127.0.0.1:8000/robs_amf/default/call/amfrpc/");
resp = new Responder(show_results, on_fault);
gateway.call("getContacts", resp);
-Rob

Categories