I am using GCP with its Cloud Functions to execute web scrapers on a frequent basis. Also locally, my script is working without any problems.
I have a setup.py file in which I am initializing the connection to a Kafka Producer. This looks like this:
p = Producer(
{
"bootstrap.servers": os.environ.get("BOOTSTRAP.SERVERS"),
"security.protocol": os.environ.get("SECURITY.PROTOCOL"),
"sasl.mechanisms": os.environ.get("SASL.MECHANISMS"),
"sasl.username": os.environ.get("SASL.USERNAME"),
"sasl.password": os.environ.get("SASL.PASSWORD"),
"session.timeout.ms": os.environ.get("SESSION.TIMEOUT.MS")
}
)
def delivery_report(err, msg):
"""Called once for each message produced to indicate delivery result.
Triggered by poll() or flush()."""
print("Got here!")
if err is not None:
print("Message delivery failed: {}".format(err))
else:
print("Message delivered to {} [{}]".format(msg.topic(), msg.partition()))
return "DONE."
I am importing this setup in main.py in which my scraping functions are defined. This looks similar to this:
from setup import p, delivery_report
def scraper():
try:
# I won't insert my whole scraper here since it's working fine ...
print(scraped_data_as_dict)
p.produce(topic, json.dumps(scraped_data_as_dict), callback=delivery_report)
p.poll(0)
except Exception as e:
# Do sth else
The point here is: I am printing my scraped data in the console. But it doesn't do anything with the producer. It's not even logging an failed producer message (deliver_report) on the console. It's like my script is ignoring the producer command. Also, there are no Error reports in the LOG of the Cloud Function. What am I doing wrong since the function is doing something, except the important stuff? What do I have to be aware of when connection Kafka with Cloud Functions?
I'm testing a script that runs binwalk on a file and then sends a kafka message to let the sending file know that it was completed or if it failed. It looks like this:
if __name__ == "__main__":
# finds the path of this file
scriptpath = os.path.dirname(inspect.getfile(inspect.currentframe()))
print(scriptpath)
# sets up kafka consumer on the binwalk topic and kafka producer for the bwsignature topic
consumer = KafkaConsumer('binwalk', bootstrap_servers=['localhost:9092'])
producer = KafkaProducer(bootstrap_servers = ['localhost:9092'])
# watches the binwalk kafka topic
for msg in consumer:
# load the json
job = json.loads(msg.value)
# get the filepath of the .bin
filepath = job["src"]
print(0)
try:
# runs the script
binwalkthedog(filepath, scriptpath)
# send a receipt
producer.send('bwsignature', b'accepted')
except:
producer.send('bwsignature', b'failed')
pass
producer.close()
consumer.close()
If I send in a file that doesn't give any errors in the 'binwalkthedog' function then it works fine, but if I give it a file that doesn't exist it prints a general error message and moves on to the next input, as it should. For some reason, the producer.send('bwsignature', b'failed') doesn't send unless there's something that creates a delay after the binwalkthedog call fails like time.sleep(1) or a for loop that counts to a million.
Obviously I could keep that in place but it's really gross and I'm sure there's a better way to do this.
This is the temp script I'm using to send and recieve a signal from the binwalkthedog module:
job = {
'src' : '/home/nick/Documents/summer-2021-intern-project/BinwalkModule/bo.bin',
'id' : 1
}
chomp = json.dumps(job).encode('ascii')
receipt = KafkaConsumer('bwsignature', bootstrap_servers=['localhost:9092'])
producer = KafkaProducer(bootstrap_servers = ['localhost:9092'])
future = producer.send('binwalk', chomp)
try:
record_metadata = future.get(timeout=10)
except KafkaError:
print("sucks")
pass
print(record_metadata.topic)
print(record_metadata.partition)
print(record_metadata.offset)
producer.close()
for msg in receipt:
print(msg.value)
break
Kafka producers batch many records together to reduce requests made to the server. If you want to force records to send, rather than introducing a blocking sleep call, or calling a get on the future, you should use producer.flush()
I use the kafka_python==2.0.0 library ,
With the piece of code below, if I do not receive a message for 1 hour, the next message pushed in the kafka topic are not processed by the consumer, however the loop does not stop.
I would like my listener to run 24/24 without losing the connection
consumer = KafkaConsumer(
os.environ.get('MY_TOPIC'),
bootstrap_servers=broker,
api_version=my_version,
security_protocol='SASL_PLAINTEXT',
sasl_mechanism='GSSAPI',
sasl_kerberos_service_name=service_name,
group_id='MY_GRP_ID',
max_poll_records=1
)
try:
for msg in consumer:
##PROCESS function ...
consumer.commit()
finally:
consumer.close()
I finally use the poll method :
from kafka import KafkaConsumer
# To consume latest messages and auto-commit offsets
consumer = KafkaConsumer('my-topic',
group_id='my-group',
bootstrap_servers=['localhost:9092'])
while True:
# Response format is {TopicPartiton('topic1', 1): [msg1, msg2]}
msg_pack = consumer.poll(timeout_ms=500)
for tp, messages in msg_pack.items():
# message value and key are raw bytes -- decode if necessary!
# e.g., for unicode: `message.value.decode('utf-8')`
print ("%s:%d:%d: key=%s value=%s" % (tp.topic, tp.partition,
message.offset, message.key,
message.value))
The advantage of my point to this syntax is better visibility on how to retrieve messages. it is not in my example but I could better manage the stopping of the program by loking for a sigterm signal
What is the easiest way to create a delay (or parking) queue with Python, Pika and RabbitMQ? I have seen an similar questions, but none for Python.
I find this an useful idea when designing applications, as it allows us to throttle messages that needs to be re-queued again.
There are always the possibility that you will receive more messages than you can handle, maybe the HTTP server is slow, or the database is under too much stress.
I also found it very useful when something went wrong in scenarios where there is a zero tolerance to losing messages, and while re-queuing messages that could not be handled may solve that. It can also cause problems where the message will be queued over and over again. Potentially causing performance issues, and log spam.
I found this extremely useful when developing my applications. As it gives you an alternative to simply re-queuing your messages. This can easily reduce the complexity of your code, and is one of many powerful hidden features in RabbitMQ.
Steps
First we need to set up two basic channels, one for the main queue, and one for the delay queue. In my example at the end, I include a couple of additional flags that are not required, but makes the code more reliable; such as confirm delivery, delivery_mode and durable. You can find more information on these in the RabbitMQ manual.
After we have set up the channels we add a binding to the main channel that we can use to send messages from the delay channel to our main queue.
channel.queue_bind(exchange='amq.direct',
queue='hello')
Next we need to configure our delay channel to forward messages to the main queue once they have expired.
delay_channel.queue_declare(queue='hello_delay', durable=True, arguments={
'x-message-ttl' : 5000,
'x-dead-letter-exchange' : 'amq.direct',
'x-dead-letter-routing-key' : 'hello'
})
x-message-ttl (Message - Time To Live)
This is normally used to automatically remove old messages in the
queue after a specific duration, but by adding two optional arguments we
can change this behaviour, and instead have this parameter determine
in milliseconds how long messages will stay in the delay queue.
x-dead-letter-routing-key
This variable allows us to transfer the message to a different queue
once they have expired, instead of the default behaviour of removing
it completely.
x-dead-letter-exchange
This variable determines which Exchange used to transfer the message from hello_delay to hello queue.
Publishing to the delay queue
When we are done setting up all the basic Pika parameters you simply send a message to the delay queue using basic publish.
delay_channel.basic_publish(exchange='',
routing_key='hello_delay',
body="test",
properties=pika.BasicProperties(delivery_mode=2))
Once you have executed the script you should see the following queues created in your RabbitMQ management module.
Example.
import pika
connection = pika.BlockingConnection(pika.ConnectionParameters(
'localhost'))
# Create normal 'Hello World' type channel.
channel = connection.channel()
channel.confirm_delivery()
channel.queue_declare(queue='hello', durable=True)
# We need to bind this channel to an exchange, that will be used to transfer
# messages from our delay queue.
channel.queue_bind(exchange='amq.direct',
queue='hello')
# Create our delay channel.
delay_channel = connection.channel()
delay_channel.confirm_delivery()
# This is where we declare the delay, and routing for our delay channel.
delay_channel.queue_declare(queue='hello_delay', durable=True, arguments={
'x-message-ttl' : 5000, # Delay until the message is transferred in milliseconds.
'x-dead-letter-exchange' : 'amq.direct', # Exchange used to transfer the message from A to B.
'x-dead-letter-routing-key' : 'hello' # Name of the queue we want the message transferred to.
})
delay_channel.basic_publish(exchange='',
routing_key='hello_delay',
body="test",
properties=pika.BasicProperties(delivery_mode=2))
print " [x] Sent"
You can use RabbitMQ official plugin: x-delayed-message .
Firstly, download and copy the ez file into Your_rabbitmq_root_path/plugins
Secondly, enable the plugin (do not need to restart the server):
rabbitmq-plugins enable rabbitmq_delayed_message_exchange
Finally, publish your message with "x-delay" headers like:
headers.put("x-delay", 5000);
Notice:
It does not ensure your message's safety, cause if your message expires just during your rabbitmq-server's downtime, unfortunately the message is lost. So be careful when you use this scheme.
Enjoy it and more info in rabbitmq-delayed-message-exchange
FYI, how to do this in Spring 3.2.x.
<rabbit:queue name="delayQueue" durable="true" queue-arguments="delayQueueArguments"/>
<rabbit:queue-arguments id="delayQueueArguments">
<entry key="x-message-ttl">
<value type="java.lang.Long">10000</value>
</entry>
<entry key="x-dead-letter-exchange" value="finalDestinationTopic"/>
<entry key="x-dead-letter-routing-key" value="finalDestinationQueue"/>
</rabbit:queue-arguments>
<rabbit:fanout-exchange name="finalDestinationTopic">
<rabbit:bindings>
<rabbit:binding queue="finalDestinationQueue"/>
</rabbit:bindings>
</rabbit:fanout-exchange>
NodeJS implementation.
Everything is pretty clear from the code.
Hope it will save somebody's time.
var ch = channel;
ch.assertExchange("my_intermediate_exchange", 'fanout', {durable: false});
ch.assertExchange("my_final_delayed_exchange", 'fanout', {durable: false});
// setup intermediate queue which will never be listened.
// all messages are TTLed so when they are "dead", they come to another exchange
ch.assertQueue("my_intermediate_queue", {
deadLetterExchange: "my_final_delayed_exchange",
messageTtl: 5000, // 5sec
}, function (err, q) {
ch.bindQueue(q.queue, "my_intermediate_exchange", '');
});
ch.assertQueue("my_final_delayed_queue", {}, function (err, q) {
ch.bindQueue(q.queue, "my_final_delayed_exchange", '');
ch.consume(q.queue, function (msg) {
console.log("delayed - [x] %s", msg.content.toString());
}, {noAck: true});
});
Message in Rabbit queue can be delayed in 2 ways
- using QUEUE TTL
- using Message TTL
If all messages in queue are to be delayed for fixed time use queue TTL.
If each message has to be delayed by varied time use Message TTL.
I have explained it using python3 and pika module.
pika BasicProperties argument 'expiration' in milliseconds has to be set to delay message in delay queue.
After setting expiration time, publish message to a delayed_queue ("not actual queue where consumers are waiting to consume") , once message in delayed_queue expires, message will be routed to a actual queue using exchange 'amq.direct'
def delay_publish(self, messages, queue, headers=None, expiration=0):
"""
Connect to RabbitMQ and publish messages to the queue
Args:
queue (string): queue name
messages (list or single item): messages to publish to rabbit queue
expiration(int): TTL in milliseconds for message
"""
delay_queue = "".join([queue, "_delay"])
logging.info('Publishing To Queue: {queue}'.format(queue=delay_queue))
logging.info('Connecting to RabbitMQ: {host}'.format(
host=self.rabbit_host))
credentials = pika.PlainCredentials(
RABBIT_MQ_USER, RABBIT_MQ_PASS)
parameters = pika.ConnectionParameters(
rabbit_host, RABBIT_MQ_PORT,
RABBIT_MQ_VHOST, credentials, heartbeat_interval=0)
connection = pika.BlockingConnection(parameters)
channel = connection.channel()
channel.queue_declare(queue=queue, durable=True)
channel.queue_bind(exchange='amq.direct',
queue=queue)
delay_channel = connection.channel()
delay_channel.queue_declare(queue=delay_queue, durable=True,
arguments={
'x-dead-letter-exchange': 'amq.direct',
'x-dead-letter-routing-key': queue
})
properties = pika.BasicProperties(
delivery_mode=2, headers=headers, expiration=str(expiration))
if type(messages) not in (list, tuple):
messages = [messages]
try:
for message in messages:
try:
json_data = json.dumps(message)
except Exception as err:
logging.error(
'Error Jsonify Payload: {err}, {payload}'.format(
err=err, payload=repr(message)), exc_info=True
)
if (type(message) is dict) and ('data' in message):
message['data'] = {}
message['error'] = 'Payload Invalid For JSON'
json_data = json.dumps(message)
else:
raise
try:
delay_channel.basic_publish(
exchange='', routing_key=delay_queue,
body=json_data, properties=properties)
except Exception as err:
logging.error(
'Error Publishing Data: {err}, {payload}'.format(
err=err, payload=json_data), exc_info=True
)
raise
except Exception:
raise
finally:
logging.info(
'Done Publishing. Closing Connection to {queue}'.format(
queue=delay_queue
)
)
connection.close()
Depends on your scenario and needs, I would recommend the following approaches,
Using the official plugin, https://www.rabbitmq.com/blog/2015/04/16/scheduling-messages-with-rabbitmq/, but it will have a capacity issue if the total count of delayed messages exceeds certain number (https://github.com/rabbitmq/rabbitmq-delayed-message-exchange/issues/72), it will not have the high availability option and it will suffer lose of data when it runs out of delayed time during a MQ restart.
Implement a set of cascading delayed queues just like NServiceBus did (https://docs.particular.net/transports/rabbitmq/delayed-delivery).
I've just got started using pika(v 0.9.4) with Tornado (through the use of pika.adapters.tornado_connection.TornadoConnection) and I was wondering what's the appropriate way of catching errors when using, say: queue_delete for when the queue you're trying to delete doesn't exist. RabbitMQ raises AMQPError, but I am not sure how this can be handled in an async way.
Anyone has any insights on this ?
Disclaimer: I'm the author of stormed-amqp
I would suggest trying with stormed-amqp
import logging
logging.basicConfig()
from tornado.ioloop import IOLoop
from stormed import Connection
def on_connect():
ch = conn.channel()
ch.queue_declare(queue='hello', durable=False)
ch.queue_declare(queue='hello', durable=True)
def on_error(e):
print "Got Connection error", e.reply_text, e.reply_code
io_loop.stop()
conn = Connection(host='localhost')
conn.on_error = on_error
conn.connect(on_connect)
io_loop = IOLoop.instance()
io_loop.start()
Try to avoid the error. If you declare a connection to the queue, and it doesn't exist, it will be created. Then immediately delete it.
Or, if you will use that queue again in the next week or so, i.e. it is not single-use, then just leave it around and handle deletion as a system admin activity that cleans up long term idle queues.
Or just declare your queues with the auto-delete attribute and they will go away when you disconnect.