Heartbeat session expired, marking coordinator dead - python

When I am using python-kafka(1.4.4) library to consume the message(Kafka version 1.1.0,Python 3.7).It throw this error again and again.And I don't know where going wrong,here is my Python code,the consumer initial:
consumer = KafkaConsumer('dolphin-spider-google-book-bookinfo',
bootstrap_servers=['mq-server:9092'],
group_id = "google-book",
client_id = "dolphin-pipline-google-bookinfo-consumer-foolman",
# Manage kafka offsets manual
enable_auto_commit = False,
consumer_timeout_ms=50000,
# consume from beginning
auto_offset_reset = "earliest",
max_poll_interval_ms =350000,
session_timeout_ms = 60000,
request_timeout_ms = 700000
)
Here is the consume logic:
def consume_bookinfo(self):
while True:
try:
for books in self.consumer:
logger.info("Get books info offset: %s" ,books.offset)
self.sub_process_handle(books.value,books.offset)
except Exception as e:
logger.error(e)
def sub_process_handle(self,bookinfo,offset):
number_of_threadings = len(threading.enumerate())
if(number_of_threadings < 13):
t = threading.Thread(target=self.background_process,name="offset-" + str(offset), args=(bookinfo,), kwargs={})
t.start()
else:
# If all threading running
# Using main thread to handle
# Slow down kafka consume speed
logger.info("Reach max handle thread,sleep 20s to wait thread release...")
time.sleep(20)
self.sub_process_handle(bookinfo,offset)
def background_process(self,bookinfo):
self.parse_bookinfo(bookinfo)
self.consumer.commit_async(callback=self.offset_commit_result)
I start multi thread to handle the consume logic.But run some time,throw this error:
2019-01-30 02:46:52,948 - /home/dolphin/source/dolphin-pipline/dolphin/biz/spider_bookinfo_consumer.py[line:37] - INFO: Get books info offset: 9304
2019-01-30 02:46:52,948 - /home/dolphin/source/dolphin-pipline/dolphin/biz/spider_bookinfo_consumer.py[line:51] - INFO: Reach max handle thread,sleep 20s to wait thread release...
2019-01-30 02:47:12,968 - /home/dolphin/source/dolphin-pipline/dolphin/biz/spider_bookinfo_consumer.py[line:61] - INFO: commit offset success,offsets: {TopicPartition(topic='dolphin-spider-google-book-bookinfo', partition=0): OffsetAndMetadata(offset=9305, metadata='')}
2019-01-30 04:27:47,322 - /usr/local/lib/python3.5/site-packages/kafka/coordinator/base.py[line:964] - WARNING: Heartbeat session expired, marking coordinator dead
2019-01-30 04:27:47,323 - /usr/local/lib/python3.5/site-packages/kafka/coordinator/base.py[line:698] - WARNING: Marking the coordinator dead (node 0) for group google-book: Heartbeat session expired.
2019-01-30 04:27:47,433 - /usr/local/lib/python3.5/site-packages/kafka/cluster.py[line:353] - INFO: Group coordinator for google-book is BrokerMetadata(nodeId=0, host='35.229.69.193', port=9092, rack=None)
2019-01-30 04:27:47,433 - /usr/local/lib/python3.5/site-packages/kafka/coordinator/base.py[line:676] - INFO: Discovered coordinator 0 for group google-book
2019-01-30 04:27:47,433 - /usr/local/lib/python3.5/site-packages/kafka/coordinator/consumer.py[line:341] - INFO: Revoking previously assigned partitions {TopicPartition(topic='dolphin-spider-google-book-bookinfo', partition=0)} for group google-book
2019-01-30 04:27:47,433 - /usr/local/lib/python3.5/site-packages/kafka/coordinator/base.py[line:434] - INFO: (Re-)joining group google-book
2019-01-30 04:27:47,437 - /usr/local/lib/python3.5/site-packages/kafka/coordinator/base.py[line:504] - INFO: Elected group leader -- performing partition assignments using range
2019-01-30 04:27:47,439 - /usr/local/lib/python3.5/site-packages/kafka/coordinator/base.py[line:333] - INFO: Successfully joined group google-book with generation 470
2019-01-30 04:27:47,439 - /usr/local/lib/python3.5/site-packages/kafka/consumer/subscription_state.py[line:257] - INFO: Updated partition assignment: [TopicPartition(topic='dolphin-spider-google-book-bookinfo', partition=0)]
2019-01-30 04:27:47,439 - /usr/local/lib/python3.5/site-packages/kafka/coordinator/consumer.py[line:238] - INFO: Setting newly assigned partitions {TopicPartition(topic='dolphin-spider-google-book-bookinfo', partition=0)} for group google-book
2019-01-30 04:27:47,694 - /home/dolphin/source/dolphin-pipline/dolphin/biz/spider_bookinfo_consumer.py[line:63] - ERROR: commit offset failed,detail: CommitFailedError: Commit cannot be completed since the group has already
rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll()
was longer than the configured max_poll_interval_ms, which
typically implies that the poll loop is spending too much
time message processing. You can address this either by
increasing the rebalance timeout with max_poll_interval_ms,
or by reducing the maximum size of batches returned in poll()
with max_poll_records.
2019-01-30 04:27:47,694 - /home/dolphin/source/dolphin-pipline/dolphin/biz/spider_bookinfo_consumer.py[line:63] - ERROR: commit offset failed,detail: CommitFailedError: Commit cannot be completed since the group has already
rebalanced and assigned the partitions to another member.
This means that the time between subsequent calls to poll()
was longer than the configured max_poll_interval_ms, which
typically implies that the poll loop is spending too much
time message processing. You can address this either by
increasing the rebalance timeout with max_poll_interval_ms,
or by reducing the maximum size of batches returned in poll()
with max_poll_records.
How to avoid this problem?What should I do?

Lets first look at what might be the cause of this error. As it is discussed in the official kafka consumer documentation(here) kafka detects connected consumers when they call poll()
After subscribing to a set of topics, the consumer will automatically join the group when poll(Duration) is invoked. The poll API is designed to ensure consumer liveness. As long as you continue to call poll, the consumer will stay in the group and continue to receive messages from the partitions it was assigned. Underneath the covers, the consumer sends periodic heartbeats to the server. If the consumer crashes or is unable to send heartbeats for a duration of session.timeout.ms, then the consumer will be considered dead and its partitions will be reassigned.
So to stay in the group, you must continue to call poll. The max.poll.interval.ms setting declares how long a consumer can stay in a group without calling poll().
Each call to poll() returns many records(default is 500) which is then iterated through in the for message in consumer. Next call to poll() is made when you finished processing all of returned records.
If it takes too long for your program to process the records max.poll.interval.ms is going to pass and you will get kicked out of the group.
Here is what you can do:
Increase max.poll.interval.ms
Decrease max.poll.records
In #Dolphin answer,he is actually is decreasing max.poll.records to 1. I prefer doing it this way:
self.consumer = kafka.KafkaConsumer(topic, bootstrap_servers='servers:ports',
group_id='group_id',max_poll_records=1,max_poll_interval_ms=300000)
the important part is max_poll_records=1. Of course, you might want to set it to a lager value than 1.

Tweak to consumer poll function:
def consume_bookinfo(self):
while True:
try:
msg_pack = self.consumer.poll(timeout_ms=5000,max_records=1)
for messages in msg_pack.items():
for message in messages:
var_type = type(message)
if(isinstance(message,TopicPartition)):
logger.info("TopicPartition: %s", TopicPartition)
if(var_type == list):
for consumer_record in message:
#for books in self.consumer.poll(max_records = 5):
logger.info("Get books info offset: %s" ,consumer_record.offset)
self.sub_process_handle(consumer_record.value,consumer_record.offset)
except Exception as e:
logger.error(e)
This works fine for me!!!!!!!!!!!!!!!!

Related

Python SimPy - Request resource for certain time, if not given then quit

So basically, to request a resource and while it's waiting, if there is no resource given to the request for X seconds, we basically don't do anything.. "customer did not get help" etc.
From their example code:
print('%s arrives at the carwash at %.2f.' % (name, env.now))
with cw.machine.request() as request:
yield request
So.. when it's requesting the cw.machine.request(), only do that for a certain duration before removing the request and for example not doing anything.
print('%s arrives at the carwash at %.2f.' % (name, env.now))
with cw.machine.request() as request:
waiting for X seconds before exiting... is it a continous loop or how does it work?
yield request
I think your example is a part of simpy's examples. To implement conditions between processes (events), you can use:
& and | operators ---> example: yield first_process & second_process
and
AnyOf() AllOf() provided in simpy.events namespace ---> example: yield AnyOf(env, list_of_events)
more about waiting for multiple events at once: READ THIS
Furthermore, you can interrupt another process, catch the interrupt exception, and do whatever you desire when a process is interrupted. READ THIS
I tried to complete the resource request the way you asked:
with self.machine.request() as request:
print(f"{car.name} requested machine at {self.env.now}")
# waiting_threshold for 45 time
waiting_threshold = self.env.timeout(45)
# yield request or waiting_threshold, whichever happens earlier
yield request | waiting_threshold
# if (didn't get the resource) or waiting_threshold is triggered:
if not request.triggered:
# Interrupt washing process
print(f"{car.name} has waited enough. Leaving carwash at {self.env.now}.")
else:
print(f"{car.name} started to be washed at {self.env.now}")
# washing time 30
yield self.env.timeout(30)
print(f"{car.name} finished to be washed at {self.env.now}")

Kafka consumer receives message if set group_id to None, but it doesn't receive any message if not None?

I have the following Kafka consumer, it works well if assigning the group_id to None - it received all historical messages and my newly tested message.
consumer = KafkaConsumer(
topic,
bootstrap_servers=bootstrap_servers,
auto_offset_reset=auto_offset_reset,
enable_auto_commit=enable_auto_commit,
group_id=group_id,
value_deserializer=lambda x: json.loads(x.decode('utf-8'))
)
for m in consumer:
However, it doesn't receive anything if I set the group_id to some value. I tried to run the test producer to send new messages and nothing is received.
The consumer console does show the following message:
2020-11-07 00:56:01 INFO ThreadPoolExecutor-0_0 base.py (Re-)joining group my_group
2020-11-07 00:56:07 INFO ThreadPoolExecutor-0_0 base.py Successfully joined group my_group with generation 497
2020-11-07 00:56:07 INFO ThreadPoolExecutor-0_0 subscription_state.py Updated partition assignment: []
2020-11-07 00:56:07 INFO ThreadPoolExecutor-0_0 consumer.py Setting newly assigned partitions set() for group my_group
One partition of a topic can only be consumed by one consumer within the same ConsumerGroup.
If you do not set the group.id, the KafkaConsumer will generate a new, random group.id for you. As this group.id is unique you will see data is being consumed.
If you have multiple consumers running with the identical group.id, only one consumer will read the data whereas the other one stays idle not consuming anything.
I know, this is not the solution to the author's problem. Still, if you landed here you might be having this problem for another reason. Same as I had.
So, at least for kafka-python v2.0.2 and Aiven Kafka broker setup, the problem was solved by adding dry call of consumer.poll().
This is especially weird since this is not required when no group_id is asssigned.
Output from:
def get():
for message in consumer:
print(message.value)
consumer.commit()
Is nothing in this case
While below works as expected. It reads out only new messages from last commit():
Output from:
def get():
consumer.poll()
for message in consumer:
print(message.value)
consumer.commit()
It outputs all messages in this topic since last commit, as expected
JFYI, class constructor looks like this:
consumer = KafkaConsumer(
topics,
bootstrap_servers=self._service_uri,
auto_offset_reset='earliest',
enable_auto_commit=False,
client_id='my_consumer_name',
group_id=self.GROUP_ID,
security_protocol="SSL",
ssl_cafile=self._ca_path,
ssl_certfile=self._cert_path,
ssl_keyfile=self._key_path,
)
¯\_(ツ)_/¯

Google Cloud PubSub: Not sending/receiving all messages from Cloud Functions

Summary: My client code triggers 861 background Google Cloud Function by publishing messages to a Pub/Sub topic. Each Cloud Function performs a task, uploads results to Google Storage, and publishing messages to another Pub/Sub topic where the client code is listening. Client code does not receive all the messages although all Cloud Functions executed (verified by the number of results in Google Storage).
Server side: I have a background Google Cloud Function which is triggered each time a message is published to a TRIGGER Pub/Sub topic. The custom attributes of the message data act as function parameter depending upon which the function performs a certain task. It then upload the result to a bucket in Google Storage and publishes a message (with taskID and execution timing details) to RESULTS Pub/Sub topic (different than the one used to trigger this function).
Client side: I need to perform 861 different tasks which requires calling the Cloud Function with 861 slightly different inputs. These tasks are similar and it takes between 20 seconds to 2 minutes (median is about 1 minute) for the Cloud Function to execute them. I have created a python script for this that I run from the Google Cloud Shell (or a local machine shell). The client python script publishes 861 messages to the TRIGGER Pub/Sub topic that triggers as many Cloud Functions concurrently, each of which is passed a unique taskID in the rage [0, 860]. The client python script then polls the RESULTS Pub/Sub topic in a "synchronous pull" way for any messages. The Cloud Function, after performing the task publishes message to RESULTS Pub/Sub topic with the unique taskID and timing details. This unique taskID is used by the client to identify from which task the message is from. It also helps in identifying duplicate messages which are discarded.
Basic steps:
Client python script publishes 861 messages (each with unique taskID) to TRIGGER Pub/Sub topic and waits for result messages from the Cloud Function.
861 different Cloud Functions are called, each of which performs a task, uploads results to Google Storage, and publishes message (with taskID and execution timing details) to RESULTS Pub/Sub topic.
The client grabs all the messages synchronously and marks the task as complete.
Problem:
When the client is polling for the messages from RESULTS Pub/Sub topic, I did not receive messages for all the taskID. I am sure that the Cloud Function got called and executed properly (I have 861 results in Google Storage bucket). I repeated this for a number of times and it occurred every time. Strangely, the number of missing taskID change every time as well as different taskID go missing across different runs. I am also keeping a track of number of duplicate taskID received. The number of unique taskID received, missing, and repeated are given in the table for 5 independent runs.
SN # of Tasks Received Missing Repeated
1 861 860 1 25
2 861 840 21 3
3 861 851 10 1
4 861 837 24 3
5 861 856 5 1
I am not sure where this problem might be arising from. Given the random nature of the number as well as taskIDs that go missing, I suspect there is some bug in the Pub/Sub at-least-once delivery logic. If in the Cloud Function, I sleep for a few seconds instead of performing the task, for example with time.sleep(5), then everything works just fine (I receive all 861 taskID at the client).
Code to reproduce this problem.
In the following, main.py along with requirements.txt are deployed as Google Cloud Function while client.py is the client code. Run the client with 100 concurrent tasks as python client.py 100 which repeats it 5 times. Different number of taskID go missing each time.
requirements.txt
google-cloud-pubsub
main.py
"""
This file is deployed as Google Cloud Function. This function starts,
sleeps for some seconds and pulishes back the taskID.
Deloyment:
gcloud functions deploy gcf_run --runtime python37 --trigger-topic <TRIGGER_TOPIC> --memory=128MB --timeout=300s
"""
import time
from random import randint
from google.cloud import pubsub_v1
# Global variables
project_id = "<Your Google Cloud Project ID>" # Your Google Cloud Project ID
topic_name = "<RESULTS_TOPIC>" # Your Pub/Sub topic name
def gcf_run(data, context):
"""Background Cloud Function to be triggered by Pub/Sub.
Args:
data (dict): The dictionary with data specific to this type of event.
context (google.cloud.functions.Context): The Cloud Functions event
metadata.
"""
# Message should contain taskID (in addition to the data)
if 'attributes' in data:
attributes = data['attributes']
if 'taskID' in attributes:
taskID = attributes['taskID']
else:
print('taskID missing!')
return
else:
print('attributes missing!')
return
# Sleep for a random time beteen 30 seconds to 1.5 minutes
print("Start execution for {}".format(taskID))
sleep_time = randint(30, 90) # sleep for this many seconds
time.sleep(sleep_time) # sleep for few seconds
# Marks this task complete by publishing a message to Pub/Sub.
data = u'Message number {}'.format(taskID)
data = data.encode('utf-8') # Data must be a bytestring
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
publisher.publish(topic_path, data=data, taskID=taskID)
return
client.py
"""
The client code creates the given number of tasks and publishes to Pub/Sub,
which in turn calls the Google Cloud Functions concurrently.
Run:
python client.py 100
"""
from __future__ import print_function
import sys
import time
from google.cloud import pubsub_v1
# Global variables
project_id = "<Google Cloud Project ID>" # Google Cloud Project ID
topic_name = "<TRIGGER_TOPIC>" # Pub/Sub topic name to publish
subscription_name = "<subscriber to RESULTS_TOPIC>" # Pub/Sub subscription name
num_experiments = 5 # number of times to repeat the experiment
time_between_exp = 120.0 # number of seconds between experiments
# Initialize the Publisher (to send commands that invoke Cloud Functions)
# as well as Subscriber (to receive results written by the Cloud Functions)
# Configure the batch to publish as soon as there is one kilobyte
# of data or one second has passed.
batch_settings = pubsub_v1.types.BatchSettings(
max_bytes=1024, # One kilobyte
max_latency=1, # One second
)
publisher = pubsub_v1.PublisherClient(batch_settings)
topic_path = publisher.topic_path(project_id, topic_name)
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
class Task:
"""
A task which will execute the Cloud Function once.
Attributes:
taskID (int) : A unique number given to a task (starting from 0).
complete (boolean) : Flag to indicate if this task has completed.
"""
def __init__(self, taskID):
self.taskID = taskID
self.complete = False
def start(self):
"""
Start the execution of Cloud Function by publishing a message with
taskID to the Pub/Sub topic.
"""
data = u'Message number {}'.format(self.taskID)
data = data.encode('utf-8') # Data must be a bytestring
publisher.publish(topic_path, data=data, taskID=str(self.taskID))
def end(self):
"""
Mark the end of this task.
Returns (boolean):
True if normal, False if task was already marked before.
"""
# If this task was not complete, mark it as completed
if not self.complete:
self.complete = True
return True
return False
# [END of Task Class]
def createTasks(num_tasks):
"""
Create a list of tasks and return it.
Args:
num_tasks (int) : Number of tasks (Cloud Function calls)
Returns (list):
A list of tasks.
"""
all_tasks = list()
for taskID in range(0, num_tasks):
all_tasks.append(Task(taskID=taskID))
return all_tasks
def receiveResults(all_tasks):
"""
Receives messages from the Pub/Sub subscription. I am using a blocking
Synchronous Pull instead of the usual asynchronous pull with a callback
funtion as I rely on a polling pattern to retrieve messages.
See: https://cloud.google.com/pubsub/docs/pull
Args:
all_tasks (list) : List of all tasks.
"""
num_tasks = len(all_tasks)
total_msg_received = 0 # track the number of messages received
NUM_MESSAGES = 10 # maximum number of messages to pull synchronously
TIMEOUT = 600.0 # number of seconds to wait for response (10 minutes)
# Keep track of elapsed time and exit if > TIMEOUT
__MyFuncStartTime = time.time()
__MyFuncElapsedTime = 0.0
print('Listening for messages on {}'.format(subscription_path))
while (total_msg_received < num_tasks) and (__MyFuncElapsedTime < TIMEOUT):
# The subscriber pulls a specific number of messages.
response = subscriber.pull(subscription_path,
max_messages=NUM_MESSAGES, timeout=TIMEOUT, retry=None)
ack_ids = []
# Keep track of all received messages
for received_message in response.received_messages:
if received_message.message.attributes:
attributes = received_message.message.attributes
taskID = int(attributes['taskID'])
if all_tasks[taskID].end():
# increment count only if task completes the first time
# if False, we received a duplicate message
total_msg_received += 1
# print("Received taskID = {} ({} of {})".format(
# taskID, total_msg_received, num_tasks))
# else:
# print('REPEATED: taskID {} was already marked'.format(taskID))
else:
print('attributes missing!')
ack_ids.append(received_message.ack_id)
# Acknowledges the received messages so they will not be sent again.
if ack_ids:
subscriber.acknowledge(subscription_path, ack_ids)
time.sleep(0.2) # Wait 200 ms before polling again
__MyFuncElapsedTime = time.time() - __MyFuncStartTime
# print("{} s elapsed. Listening again.".format(__MyFuncElapsedTime))
# if total_msg_received != num_tasks, function exit due to timeout
if total_msg_received != num_tasks:
print("WARNING: *** Receiver timed out! ***")
print("Received {} messages out of {}. Done.".format(
total_msg_received, num_tasks))
def main(num_tasks):
"""
Main execution point of the program
"""
for experiment_num in range(1, num_experiments + 1):
print("Starting experiment {} of {} with {} tasks".format(
experiment_num, num_experiments, num_tasks))
# Create all tasks and start them
all_tasks = createTasks(num_tasks)
for task in all_tasks: # Start all tasks
task.start()
print("Published {} taskIDs".format(num_tasks))
receiveResults(all_tasks) # Receive message from Pub/Sub subscription
print("Waiting {} seconds\n\n".format(time_between_exp))
time.sleep(time_between_exp) # sleep between experiments
if __name__ == "__main__":
if(len(sys.argv) != 2):
print("usage: python client.py <num_tasks>")
print(" num_tasks: Number of concurrent Cloud Function calls")
sys.exit()
num_tasks = int(sys.argv[1])
main(num_tasks)
In your cloud function, in this line:
publisher.publish(topic_path, data=data, taskID=taskID)
You are not waiting for the future that publisher.publish returns. This means you cannot be guaranteed that the publish onto the topic has actually happened when you fall off the end of the gcf_run function, but the message on the TRIGGER topic cloud functions subscription is ACK-ed anyway.
Instead, to wait until the publish occurs for the cloud function to terminate, this should be:
publisher.publish(topic_path, data=data, taskID=taskID).result()
You should also avoid bringing up and tearing down the publisher client on each function call, instead having the client as a global variable.

Celery retried messages remain in same worker

I had in the back of my mind the impression that if a celery worker gets a task , and it is retried - it remains in the worker's memory (with the eta) - and doesn't return to the queue.
resulting in that if a celery task is retried and the worker is busy working on different tasks , and that task eta arrives- it has to wait until it finishes processing the other tasks.
I tried looking in the documentation for something that is aligned with what I remembered , but I can't find anything.
what I did to try and check it is create two tasks.
#app.task(bind=True, name='task_that_holds_worker', rate_limit='4/m',
default_retry_delay=5 * 60,
max_retries=int(60 * 60 * 24 * 1 / (60 * 5)))
def task_that_holds_worker(self, *args, **kwargs):
import time
time.sleep(50000)
#app.task(bind=True, name='retried_task', rate_limit='2/m',
default_retry_delay=10 * 60,
max_retries=int(60 * 60 * 24 * 1 / (60 * 10)))
def retried_task(self, *args, **kwargs):
self.retry()
the simplest tasks , just to check that if a task is busy with other task - the retried task is not processed by another worker.
I then launched one worker - and triggered those two tasks in the following way:
from some_app import tasks
from some_app.celery_app import app
current_app = app.tasks
async_result = tasks.retried_task.delay()
import time
time.sleep(20)
async_result = tasks.task_that_holds_worker.delay()
the worker processed the retried task , and retried it,
and then moved to the task that sleeps.
I then launched another worker and i can see that it is not getting the 'retried' task, only the first worker.
each worker launched was launced with --prefetch-multiplier=1 --concurrency=1
Is there something wrong with the way I reproduced this?
or is this the way a celery retried task behaves?
Thanks in advance!
celery: 4.1.2
Python: 3.6.2
Rabbitmq Image: rabbitmq:3.6.9-management
Seems like this is an issue with tasks with eta. the first available worker counts down until the task eta and doesn't release it back to the queue. (prefetch count is increased and ignored)
https://github.com/celery/celery/issues/2541
There is an error with how you reproduced it. Unless you have a special broker, celery will always requeue a task retry request back to the broker. Workers do not retain any memory of which task they attempted, and there is no data added to the retry request that allows celery to route the task request back to the same worker. There is no guarantee or assurance that the same worker will retry a task that it has seen before. You can confirm this in the code for celery in celery/app.task.py
# get the signature of the task as called
S = self.signature_from_request(
request, args, kwargs,
countdown=countdown, eta=eta, retries=retries,
**options
)
if max_retries is not None and retries > max_retries:
if exc:
# On Py3: will augment any current exception with
# the exc' argument provided (raise exc from orig)
raise_with_context(exc)
raise self.MaxRetriesExceededError(
"Can't retry {0}[{1}] args:{2} kwargs:{3}".format(
self.name, request.id, S.args, S.kwargs))
ret = Retry(exc=exc, when=eta or countdown)
if is_eager:
# if task was executed eagerly using apply(),
# then the retry must also be executed eagerly.
S.apply().get()
if throw:
raise ret
return ret
try:
S.apply_async()
except Exception as exc:
raise Reject(exc, requeue=False)
if throw:
raise ret
return ret
I've bolded the part where you can see how the retry works. Celery gets the tasks request signature (this include the task name, and the arguments to the task, and sets the eta, countdown, and retries). And then celery will simply call apply_async, which under the hood will just queue up a new task request to the broker.
Your sample did not work because celery workers will often pull more than one task request off of the broker, so what likely happened is that the first worker grabbed the task off of the broker before the second worker had come online.

How to pause and resume consumption gracefully in rabbitmq, pika python

I'm using basic_consume() for receiving messages and basic_cancel for canceling consuming, but there is a problem.
Here is the code of pika.channel
def basic_consume(self, consumer_callback, queue='', no_ack=False,
exclusive=False, consumer_tag=None):
"""Sends the AMQP command Basic.Consume to the broker and binds messages
for the consumer_tag to the consumer callback. If you do not pass in
a consumer_tag, one will be automatically generated for you. Returns
the consumer tag.
For more information on basic_consume, see:
http://www.rabbitmq.com/amqp-0-9-1-reference.html#basic.consume
:param method consumer_callback: The method to callback when consuming
:param queue: The queue to consume from
:type queue: str or unicode
:param bool no_ack: Tell the broker to not expect a response
:param bool exclusive: Don't allow other consumers on the queue
:param consumer_tag: Specify your own consumer tag
:type consumer_tag: str or unicode
:rtype: str
"""
self._validate_channel_and_callback(consumer_callback)
# If a consumer tag was not passed, create one
consumer_tag = consumer_tag or 'ctag%i.%s' % (self.channel_number,
uuid.uuid4().get_hex())
if consumer_tag in self._consumers or consumer_tag in self._cancelled:
raise exceptions.DuplicateConsumerTag(consumer_tag)
self._consumers[consumer_tag] = consumer_callback
self._pending[consumer_tag] = list()
self._rpc(spec.Basic.Consume(queue=queue,
consumer_tag=consumer_tag,
no_ack=no_ack,
exclusive=exclusive),
self._on_eventok,
[(spec.Basic.ConsumeOk,
{'consumer_tag': consumer_tag})])
return consumer_tag
def basic_cancel(self, callback=None, consumer_tag='', nowait=False):
"""This method cancels a consumer. This does not affect already
delivered messages, but it does mean the server will not send any more
messages for that consumer. The client may receive an arbitrary number
of messages in between sending the cancel method and receiving the
cancel-ok reply. It may also be sent from the server to the client in
the event of the consumer being unexpectedly cancelled (i.e. cancelled
for any reason other than the server receiving the corresponding
basic.cancel from the client). This allows clients to be notified of
the loss of consumers due to events such as queue deletion.
:param method callback: Method to call for a Basic.CancelOk response
:param str consumer_tag: Identifier for the consumer
:param bool nowait: Do not expect a Basic.CancelOk response
:raises: ValueError
"""
self._validate_channel_and_callback(callback)
if consumer_tag not in self.consumer_tags:
return
if callback:
if nowait is True:
raise ValueError('Can not pass a callback if nowait is True')
self.callbacks.add(self.channel_number,
spec.Basic.CancelOk,
callback)
self._cancelled.append(consumer_tag)
self._rpc(spec.Basic.Cancel(consumer_tag=consumer_tag,
nowait=nowait),
self._on_cancelok,
[(spec.Basic.CancelOk,
{'consumer_tag': consumer_tag})] if nowait is False else [])
As you can see every time I'm cancelling consumption consumer_tag is added to _canceled list. And if I would use this tag in basic_consume again the duplicateConsumer exception will be raised.
Well, I could use a new consumer_tag every time, but in fact I'm not. Because sooner or later generated tag would be exactly match some of the previous ones.
How should I pause and resume consumption gracefully in pika?
Is there a reason why you define your own consumer_tags? You can pass an empty string and let RabbitMQ generate consumer tags for you. The reply from basic.consume, which is basic.consume-ok will return the generated consumer_tag, so you can use it later to stop consuming.
See: http://www.rabbitmq.com/amqp-0-9-1-reference.html#basic.consume-ok
That looks like Pika is doing more than it should - it doesn't need to create a consumer tag if one is not supplied (the server will) and it also doesn't need to watch for duplicated consumer tags (resuming with the same tag is supported by the server).
So I'm not sure how to do this with Pika - file a bug I suppose.

Categories