Google Cloud PubSub: Not sending/receiving all messages from Cloud Functions

Google Cloud PubSub: Not sending/receiving all messages from Cloud Functions - python

Summary: My client code triggers 861 background Google Cloud Function by publishing messages to a Pub/Sub topic. Each Cloud Function performs a task, uploads results to Google Storage, and publishing messages to another Pub/Sub topic where the client code is listening. Client code does not receive all the messages although all Cloud Functions executed (verified by the number of results in Google Storage).
Server side: I have a background Google Cloud Function which is triggered each time a message is published to a TRIGGER Pub/Sub topic. The custom attributes of the message data act as function parameter depending upon which the function performs a certain task. It then upload the result to a bucket in Google Storage and publishes a message (with taskID and execution timing details) to RESULTS Pub/Sub topic (different than the one used to trigger this function).
Client side: I need to perform 861 different tasks which requires calling the Cloud Function with 861 slightly different inputs. These tasks are similar and it takes between 20 seconds to 2 minutes (median is about 1 minute) for the Cloud Function to execute them. I have created a python script for this that I run from the Google Cloud Shell (or a local machine shell). The client python script publishes 861 messages to the TRIGGER Pub/Sub topic that triggers as many Cloud Functions concurrently, each of which is passed a unique taskID in the rage [0, 860]. The client python script then polls the RESULTS Pub/Sub topic in a "synchronous pull" way for any messages. The Cloud Function, after performing the task publishes message to RESULTS Pub/Sub topic with the unique taskID and timing details. This unique taskID is used by the client to identify from which task the message is from. It also helps in identifying duplicate messages which are discarded.
Basic steps:
Client python script publishes 861 messages (each with unique taskID) to TRIGGER Pub/Sub topic and waits for result messages from the Cloud Function.
861 different Cloud Functions are called, each of which performs a task, uploads results to Google Storage, and publishes message (with taskID and execution timing details) to RESULTS Pub/Sub topic.
The client grabs all the messages synchronously and marks the task as complete.
Problem:
When the client is polling for the messages from RESULTS Pub/Sub topic, I did not receive messages for all the taskID. I am sure that the Cloud Function got called and executed properly (I have 861 results in Google Storage bucket). I repeated this for a number of times and it occurred every time. Strangely, the number of missing taskID change every time as well as different taskID go missing across different runs. I am also keeping a track of number of duplicate taskID received. The number of unique taskID received, missing, and repeated are given in the table for 5 independent runs.
SN # of Tasks Received Missing Repeated
1 861 860 1 25
2 861 840 21 3
3 861 851 10 1
4 861 837 24 3
5 861 856 5 1
I am not sure where this problem might be arising from. Given the random nature of the number as well as taskIDs that go missing, I suspect there is some bug in the Pub/Sub at-least-once delivery logic. If in the Cloud Function, I sleep for a few seconds instead of performing the task, for example with time.sleep(5), then everything works just fine (I receive all 861 taskID at the client).
Code to reproduce this problem.
In the following, main.py along with requirements.txt are deployed as Google Cloud Function while client.py is the client code. Run the client with 100 concurrent tasks as python client.py 100 which repeats it 5 times. Different number of taskID go missing each time.
requirements.txt
google-cloud-pubsub
main.py
"""
This file is deployed as Google Cloud Function. This function starts,
sleeps for some seconds and pulishes back the taskID.
Deloyment:
gcloud functions deploy gcf_run --runtime python37 --trigger-topic <TRIGGER_TOPIC> --memory=128MB --timeout=300s
"""
import time
from random import randint
from google.cloud import pubsub_v1
# Global variables
project_id = "<Your Google Cloud Project ID>" # Your Google Cloud Project ID
topic_name = "<RESULTS_TOPIC>" # Your Pub/Sub topic name
def gcf_run(data, context):
"""Background Cloud Function to be triggered by Pub/Sub.
Args:
data (dict): The dictionary with data specific to this type of event.
context (google.cloud.functions.Context): The Cloud Functions event
metadata.
"""
# Message should contain taskID (in addition to the data)
if 'attributes' in data:
attributes = data['attributes']
if 'taskID' in attributes:
taskID = attributes['taskID']
else:
print('taskID missing!')
return
else:
print('attributes missing!')
return
# Sleep for a random time beteen 30 seconds to 1.5 minutes
print("Start execution for {}".format(taskID))
sleep_time = randint(30, 90) # sleep for this many seconds
time.sleep(sleep_time) # sleep for few seconds
# Marks this task complete by publishing a message to Pub/Sub.
data = u'Message number {}'.format(taskID)
data = data.encode('utf-8') # Data must be a bytestring
publisher = pubsub_v1.PublisherClient()
topic_path = publisher.topic_path(project_id, topic_name)
publisher.publish(topic_path, data=data, taskID=taskID)
return
client.py
"""
The client code creates the given number of tasks and publishes to Pub/Sub,
which in turn calls the Google Cloud Functions concurrently.
Run:
python client.py 100
"""
from __future__ import print_function
import sys
import time
from google.cloud import pubsub_v1
# Global variables
project_id = "<Google Cloud Project ID>" # Google Cloud Project ID
topic_name = "<TRIGGER_TOPIC>" # Pub/Sub topic name to publish
subscription_name = "<subscriber to RESULTS_TOPIC>" # Pub/Sub subscription name
num_experiments = 5 # number of times to repeat the experiment
time_between_exp = 120.0 # number of seconds between experiments
# Initialize the Publisher (to send commands that invoke Cloud Functions)
# as well as Subscriber (to receive results written by the Cloud Functions)
# Configure the batch to publish as soon as there is one kilobyte
# of data or one second has passed.
batch_settings = pubsub_v1.types.BatchSettings(
max_bytes=1024, # One kilobyte
max_latency=1, # One second
)
publisher = pubsub_v1.PublisherClient(batch_settings)
topic_path = publisher.topic_path(project_id, topic_name)
subscriber = pubsub_v1.SubscriberClient()
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
class Task:
"""
A task which will execute the Cloud Function once.
Attributes:
taskID (int) : A unique number given to a task (starting from 0).
complete (boolean) : Flag to indicate if this task has completed.
"""
def __init__(self, taskID):
self.taskID = taskID
self.complete = False
def start(self):
"""
Start the execution of Cloud Function by publishing a message with
taskID to the Pub/Sub topic.
"""
data = u'Message number {}'.format(self.taskID)
data = data.encode('utf-8') # Data must be a bytestring
publisher.publish(topic_path, data=data, taskID=str(self.taskID))
def end(self):
"""
Mark the end of this task.
Returns (boolean):
True if normal, False if task was already marked before.
"""
# If this task was not complete, mark it as completed
if not self.complete:
self.complete = True
return True
return False
# [END of Task Class]
def createTasks(num_tasks):
"""
Create a list of tasks and return it.
Args:
num_tasks (int) : Number of tasks (Cloud Function calls)
Returns (list):
A list of tasks.
"""
all_tasks = list()
for taskID in range(0, num_tasks):
all_tasks.append(Task(taskID=taskID))
return all_tasks
def receiveResults(all_tasks):
"""
Receives messages from the Pub/Sub subscription. I am using a blocking
Synchronous Pull instead of the usual asynchronous pull with a callback
funtion as I rely on a polling pattern to retrieve messages.
See: https://cloud.google.com/pubsub/docs/pull
Args:
all_tasks (list) : List of all tasks.
"""
num_tasks = len(all_tasks)
total_msg_received = 0 # track the number of messages received
NUM_MESSAGES = 10 # maximum number of messages to pull synchronously
TIMEOUT = 600.0 # number of seconds to wait for response (10 minutes)
# Keep track of elapsed time and exit if > TIMEOUT
__MyFuncStartTime = time.time()
__MyFuncElapsedTime = 0.0
print('Listening for messages on {}'.format(subscription_path))
while (total_msg_received < num_tasks) and (__MyFuncElapsedTime < TIMEOUT):
# The subscriber pulls a specific number of messages.
response = subscriber.pull(subscription_path,
max_messages=NUM_MESSAGES, timeout=TIMEOUT, retry=None)
ack_ids = []
# Keep track of all received messages
for received_message in response.received_messages:
if received_message.message.attributes:
attributes = received_message.message.attributes
taskID = int(attributes['taskID'])
if all_tasks[taskID].end():
# increment count only if task completes the first time
# if False, we received a duplicate message
total_msg_received += 1
# print("Received taskID = {} ({} of {})".format(
# taskID, total_msg_received, num_tasks))
# else:
# print('REPEATED: taskID {} was already marked'.format(taskID))
else:
print('attributes missing!')
ack_ids.append(received_message.ack_id)
# Acknowledges the received messages so they will not be sent again.
if ack_ids:
subscriber.acknowledge(subscription_path, ack_ids)
time.sleep(0.2) # Wait 200 ms before polling again
__MyFuncElapsedTime = time.time() - __MyFuncStartTime
# print("{} s elapsed. Listening again.".format(__MyFuncElapsedTime))
# if total_msg_received != num_tasks, function exit due to timeout
if total_msg_received != num_tasks:
print("WARNING: *** Receiver timed out! ***")
print("Received {} messages out of {}. Done.".format(
total_msg_received, num_tasks))
def main(num_tasks):
"""
Main execution point of the program
"""
for experiment_num in range(1, num_experiments + 1):
print("Starting experiment {} of {} with {} tasks".format(
experiment_num, num_experiments, num_tasks))
# Create all tasks and start them
all_tasks = createTasks(num_tasks)
for task in all_tasks: # Start all tasks
task.start()
print("Published {} taskIDs".format(num_tasks))
receiveResults(all_tasks) # Receive message from Pub/Sub subscription
print("Waiting {} seconds\n\n".format(time_between_exp))
time.sleep(time_between_exp) # sleep between experiments
if __name__ == "__main__":
if(len(sys.argv) != 2):
print("usage: python client.py <num_tasks>")
print(" num_tasks: Number of concurrent Cloud Function calls")
sys.exit()
num_tasks = int(sys.argv[1])
main(num_tasks)

In your cloud function, in this line:
publisher.publish(topic_path, data=data, taskID=taskID)
You are not waiting for the future that publisher.publish returns. This means you cannot be guaranteed that the publish onto the topic has actually happened when you fall off the end of the gcf_run function, but the message on the TRIGGER topic cloud functions subscription is ACK-ed anyway.
Instead, to wait until the publish occurs for the cloud function to terminate, this should be:
publisher.publish(topic_path, data=data, taskID=taskID).result()
You should also avoid bringing up and tearing down the publisher client on each function call, instead having the client as a global variable.

Related

Websocket buffer delayed Python

I
setting up a Websocket that receives market data from 33 pairs, process the data and insert it into a local mysql database.
what I've tried so far :
Setting up the websocket works fine, then process the data on each new message function and insert it directly into the database
--> problem was that with 33 pairs the websocket was stacking up the buffer with market data, and after a few minutes I would get a delay in the database of at least 10 seconds
Then I tried processing the data through a thread : the on_message function would execute a thread that is simply putting the market data into an array, like below
datas=[]
def add_queue(symbol,t,a,b,r_n):
global datas
datas.append([symbol,t,a,b,r_n])
if json_msg['ev']=="C":
symbol=json_msg['p'].replace("/","-")
round_number=pairs_dict_new[symbol]
t = Thread(target=add_queue, args=(symbol,json_msg['t'],json_msg['a'],json_msg['b'],round_number,))
t.start()
and then another function, with a loop thread would pick it up to insert it into the database
def add_db():
global datas
try:
# db = mysql.connector.connect(
# host="104.168.157.164",
# user="bvnwurux_noe_dev",
# password="Tickprofile333",
# database="bvnwurux_tick_values"
# )
while True:
for x in datas:
database.add_db(x[0],x[1],x[2],x[3],x[4])
if x in datas:
datas.remove(x)
except KeyboardInterrupt:
print("program ending..")
t2 = Thread(target=add_db)
t2.start()
still giving a delay, and the threaded process wasn't actually using a lot of CPU but more of RAM and it just was even worse.
instead of using a websocket with a thread, I tried simple webrequests to the API call, so with 1 thread per symbol, it would loop through a webrequest and in everythread send it to the database. my issues here were that mysql connections don't like threads (sometimes they would make a request with the same connection at the same time and crash) or it would still be delayed by the time to process the code, even without buffer. the code was taking too long to process the answered request that it couldnt keep it under 10s of delay.
Here is a little example of the basic code I used to get the data.
pairs={'AUDCAD':5,'AUDCHF':5,'AUDJPY':3,'AUDNZD':5,'AUDSGD':2,'AUDUSD':5,'CADCHF':5,'CADJPY':3,'CHFJPY':3,'EURAUD':5,'EURCAD':5,'EURCHF':5,'EURGBP':5,'EURJPY':3,'EURNZD':5,'EURSGD':5,'EURUSD':5,'GBPAUD':5,'GBPCAD':5,'GBPCHF':5,'GBPJPY':3,'GBPNZD':5,'GBPSGD':5,'GBPUSD':5,'NZDCAD':5,'NZDCHF':5,'NZDJPY':3,'NZDUSD':5,'USDCAD':5,'USDCHF':5,'USDJPY':3,'USDSGD':5,'SGDJPY':3}
def on_open(ws):
print("Opened connection")
ws.send('{"action":"auth","params":"<API KEY>"}') #connecting with secret api key
def on_message(ws, message):
print("msg",message)
json_msg = json.loads(message)[0]
if json_msg['status'] == "auth_success": # successfully authenticated
r = ws.send('{"action":"subscribe","params":"C.*"}') # subscribing to currencies
print("should subscribe to " + pairs)
#once the websocket is connected to all the pairs, process the data
--> process json_msg
if __name__ == "__main__":
# websocket.enableTrace(True) # just to show all the requests made (debug mode)
ws = websocket.WebSocketApp("wss://socket.polygon.io/forex",
on_open=on_open,
on_message=on_message)
ws.run_forever(dispatcher=rel) # Set dispatcher to automatic reconnection
rel.signal(2, rel.abort) # Keyboard Interrupt
rel.dispatch()
method I tried multiprocess, but this was on the other crashing my server because it would use 100% CPU, and then the requests made on the apache server would not reach or take a long time loading. Its really a balance problem
I'm using an ubuntu server with 32CPUS, based in london and the API polygon is based in NYC.
I also tried with 4 CPUS in seattle to NYC, but still no luck.
Even with 4 pairs and 32CPUS , it would eventually reach 10s delay. I think this is more of a code structure problem.

Kafka | exactly-once consumer consume a message more than once

In our applications have enabled exactly-once in both Producer and Consumer.
Producer is a python component.We have enabled:
idempotence
use transactions (new transactionId is used every time when we send messages)
Consumer is a Spring Boot application. We have enabled:
read_committed isolation level
use manual acknowledgement for messages
We have multi-partition Kafka topic (lets say 3 partitions) on ConfluentCloud.
Our application design is as follows:
multiple Producer app instances
for performance ,we have lots of Consumer app instances (currently around 24)
Problem:
We noticed that sometimes the same Kafka message is consumed more than once in the Consumer.We detected this by using following consumer code. We keep the previously consumed kafka message Id (with offset) in Redis and compare them with newly consumed message.
Consumer code:
#KafkaListener(topics = "${datalake.datasetevents.topic}", groupId = "${spring.kafka.consumer.group-id}")
public void listen(#Header(KafkaHeaders.RECEIVED_MESSAGE_KEY) String key,
#Header(KafkaHeaders.OFFSET) String offset,
#Payload InputEvent inputEvent, Acknowledgment acknowledgment) {
//KafkaHeaders.
Event event = new Event();
event.setCorrId(inputEvent.getCorrId());
event.setQn(inputEvent.getQn());
event.setCreatedTs(new Date());
event.setEventTs(inputEvent.getEventTs());
event.setMeta(inputEvent.getMeta() != null ? inputEvent.getMeta(): new HashMap<>());
event.setType(inputEvent.getType());
event.setUlid(key);
//detect message duplications
try {
String eventRedisKey = "tg_e_d_" + key.toLowerCase();
String redisVal = offset;
String tmp = redisTemplateString.opsForValue().get(eventRedisKey);
if (tmp != null) {
dlkLogging.error("kafka_event_dup", "Event consumed more than once ulid:" + event.getUlid()+ " redis offset: "+tmp+ " event offset:"+offset);
redisTemplateString.delete(eventRedisKey);
}
redisTemplateString.opsForValue().set(eventRedisKey, redisVal, 30, TimeUnit.SECONDS);
} catch (Exception e) {
dlkLogging.error("kafka_consumer_redis","Redis error at kafka consumere ", e);
}
//process the message and ack
try {
eventService.saveEvent(persistEvent, event);
ack.acknowledge();
} catch (Exception ee) {
//Refer : https://stackoverflow.com/questions/62413270/kafka-what-is-the-point-of-using-acknowledgment-nack-if-i-can-simply-not-ack
ack.nack(1);
dlkLogging.error("event_sink_error","error sinking kafka event.Will retry", ee);
}
}
Behavior:
We notice the "kafka_event_dup" is sent several times per day.
Error Message: Event consumed more than once
ulid:01G77G8KNTSM2Q01SB1MK60BTH redis offset: 659238 event
offset:659238
Question:
Why consumer read the same message even though we have configured exactly-once in both Producer and Consumer?
Update : After reading several SO posts, seems we still need to implement deduplication logic in Consumer side even though exactly-once is configured?
Additional Info:
Consumer configuration:
public DefaultKafkaConsumerFactory kafkaDatasetEventConsumerFactory(KafkaProperties properties) {
Map<String, Object> props = properties.buildConsumerProperties();
props.put(ENABLE_AUTO_COMMIT_CONFIG, false);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer.class);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer.class);
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer.class);
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, ErrorHandlingDeserializer.class);
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
props.put(ErrorHandlingDeserializer.KEY_DESERIALIZER_CLASS, StringDeserializer.class);
props.put(ErrorHandlingDeserializer.VALUE_DESERIALIZER_CLASS, CustomJsonDeserializer.class.getName());
props.put(JsonDeserializer.VALUE_DEFAULT_TYPE, "com.fr.det.datalake.eventdriven.model.kafka.InputEvent");
return new DefaultKafkaConsumerFactory(props);
}
Producer code (python):
def __get_producer(self):
conf = {
'bootstrap.servers': self.server,
'enable.idempotence': True,
'acks': 'all',
'retry.backoff.ms': self.sleep_seconds * 100
}
if self.sasl_mechanism:
conf['sasl.mechanisms'] = self.sasl_mechanism
if self.security_protocol:
conf['security.protocol'] = self.security_protocol
if self.sasl_username:
conf['sasl.username'] = self.sasl_username
if self.sasl_username:
conf['sasl.password'] = self.sasl_password
if self.transaction_prefix:
conf['transactional.id'] = self.__get_transaction_id()
producer = Producer(conf)
return producer
#_retry_on_error
def send_messages(self, messages, *args, **kwargs):
ts = time.time()
producer = kwargs.get('producer', None)
if producer is not None:
for message in messages:
key = message.get('key', str(ulid.from_timestamp(ts)))
value = message.get('value', None)
topic = message.get('topic', self.topic)
producer.produce(topic=topic,
value=value,
key=key,
on_delivery=self.acked)
producer.commit_transaction(30)
def _retry_on_error(func, *args, **kwargs):
def inner(self, messages, *args, **kwargs):
attempts = 0
while True:
attempts += 1
sleep_time = attempts * self.sleep_seconds
try:
producer = self.__get_producer()
self.logger.info(f"Producer: {producer}, Attempt: {attempts}")
producer.init_transactions(30)
producer.begin_transaction()
res = func(self, messages, *args, producer=producer, **kwargs)
return res
except KafkaException as e:
if attempts <= self.retry_count:
if e.args[0].txn_requires_abort():
producer.abort_transaction(30)
time.sleep(sleep_time)
continue
self.logger.error(str(e), exc_info=True, extra=extra)
break
return inner

Kafka exactly-once is essentially a Kafka-Streams feature, although it can be used with regular consumer and producers as well.
Exactly once can only be achieved in a context where your applications are only interacting with Kafka: there is no XA nor other kind of distributed transactions across technologies that would enable a Kafka consumer interact with some other storage (like Redis) in an exactly-once manner.
In a distributed world, we have to acknowledge that is not desirable, since it introduce locking, contention, and exponentially degrading performance under load. If we don't need to be in a distributed world, then we don't need Kafka and many things become easier.
Transactions in Kafka are meant to be used within one application that is only interacting with Kafka, it lets you guarantee that the app will 1) read from some topic partitions, 2) write some result in some other topic partitions and 3) commit the read offsets related to 1, or do none of those things. If several apps are put back-to-back and interacting through Kafka in such manner, then you could achieve exactly once if you're very careful. If your consumer needs to 4) interact with Redis 5) interact with some other storage or do some side effect somewhere (like sending an email or so), then there is in general no way to perform steps 1,2,3,4,5 atomically as part of a distributed application. You can achieve this kind of things with other storage technologies (yes, Kafka is essentially a storage), but they cannot be distributed and your application cannot either. That's essentially what the CAP theorem tells us.
That's also why exactly-once is essentially a Kafka-streams stuff: Kafka Stream is just a smart wrapper around the Kafka consumer/producer clients that lest you build applications that interact only with Kafka.
You can also achieve exactly-once streaming processing with other data-processing framework, like Spark Streaming or Flink.
In practice it's often much simpler to not bother with transactions and just de-duplicate in the consumer. You have the guarantee that at max one consumer of the consumer group is connected to each partition at any point in time, so duplicates will always happen in the same instance of your app (until it re-scales), and, depending on your config, the duplication should typically only happen within one single Kafka consumer buffer, so you don't need to store much state in your consumer to de-duplicate. If you use some kind of event-ids that can only increase for example (which is essentially what the Kafka offset is BTW, and it's no coincidence), then you just need to keep in the state of each instance of your app the maximum event-id per partition that you've successfully processed.

I can see you have set ENABLE_AUTO_COMMIT_CONFIG to false, that means you are having a manual commit process in place. If we are not committing the offset of the messages read efficiently, then we will end up in processing duplicate messages.
Kindly refer session from 4.6 from https://www.baeldung.com/kafka-exactly-once
Also, for processing.guarantee : exactly_once the following parameters you no need to set explicitly.
isolation.level=read_committed
enable.idempotence=true
MAX_IN_FLIGHT_REQUESTS_PER_CONNECTION=5

Quickfix Python issue with message sequence numbers between heartbeat and order event

I currently have a script running using the Python version of Quickfix, in order to connect to a counterparty to send and receive FIX protocol messages. I successfully connect, logon and send regular heartbeat messages, with the relevant outgoing and incoming messages incrementing their respective message sequence numbers in the correct manner.
I am looking to create a loop whereby I query a DB to determine whether any new orders have arrived. Once they do appear, an order request is then sent to the counterparty.
The issue arises when the order event is triggered. The message sequence number for my outgoing message ends up being the same as the subsequent heartbest outgoing message (which seems to get triggered at exactly the same time. See below for snippet of issue:
Message type Order:
2021-06-28 15:40:12.018_699 [3370-4e4ea14a:7955d73cf3:189] (INFO) Receiving : 8=FIX.4.4|9=214|35=D|34=189|49=xxxx|52=20210628-14:40:12.016|56=xxxx|11=dzjtufgnmyy|15=EUR|22=4|38=5000|40=1|48=XS2058749925|54=1|55=N/A|60=20210628-14:40:12|115=3423581|423=1|453=1|448=210057308|447=D|452=7|537=1|10=002|
Message type Heartbeat:
2021-06-28 15:40:12.059_849 [3370-4e4ea14a:7955d73d1c:189] (DEBUG) Receiving : 8=FIX.4.4|9=87|35=0|34=189|49=xxxx|52=20210628-14:40:12.003|56=xxxx|112=synchronized?|10=252|
This is how I start the session, I am the acceptor, running through a thread:
class Threading(object):
def __init__(self, interval=1):
self.interval = interval
thread = threading.Thread(target=self.run, args=())
thread.daemon = True
thread.start()
def run(self):
while True:
settings = fix.SessionSettings(fileName)
application = Application()
storeFactory = fix.FileStoreFactory(settings)
logFactory = fix.ScreenLogFactory(settings)
acceptor = fix.SocketAcceptor(application, storeFactory, settings, logFactory)
acceptor.start()
print("Started acceptor")
I need to poll the database on a regular basis, so I chose on each heartbeat (every 30 seconds), to run a function. I delay for 5 seconds so as to try and avoid sending the order at same time as heartbeat message (obviously not working!).
Here is the code for the script to poll the database:
def toAdmin(self, message, sessionID):
print("toAdmin")
if db_MsgType_fieldvalue == '0':
time.sleep(5)
self.checkDB(sessionID)
Finally, here is the code for when the database gets queried and when the message gets sent:
def checkDB(self, sessionID):
print("Start test trade")
fixtradesdf = pd.read_sql_query("Select * from database where ordstatus = '0'",engine)
for index, row in fixtradesdf.iterrows():
message = fix.Message();
header = message.getHeader();
header.setField(fix.BeginString("FIX.4.4"))
header.setField(fix.BodyLength(137))
header.setField(fix.MsgType("D"))
header.setField(fix.SenderCompID("xxxx"))
header.setField(fix.TargetCompID("xxxx"))
header.setField(fix.SendingTime())
message.setField(fix.ClOrdID("xxxx"))
message.setField(fix.Symbol("N/A"))
if row['side'] == 'BUY':
side = fix.Side_BUY
else:
side = fix.Side_SELL
message.setField(fix.Side(side))
message.setField(fix.TransactTime())
message.setField(fix.OrderQty(orderqty))
message.setField(fix.Currency("EUR"))
message.setField(fix.SecurityID("XXXX"))
message.setField(fix.SecurityIDSource("4"))
message.setField(fix.OrdType(fix.OrdType_MARKET))
message.setField(fix.PriceType(1))
message.setField(fix.QuoteType(1))
group = fix.Group(453,448)
group.setField(fix.PartyID("XXXX"))
group.setField(fix.PartyIDSource("D"))
group.setField(fix.PartyRole(7))
message.addGroup(group)
message.setField(fix.OnBehalfOfCompID("XXXX"))
fix.Session.sendToTarget(message, sessionID)
Any ideas how I can send the order message, without duplicating the message sequence number and hence breaking the connection?

(Step Functions Activity Worker) Best practice for handling long polling timeouts in boto?

I am working on my first Step Functions Activity Worker (EC2). Predictably, after 5 minutes of long polling with no activity from the Step Functions state machine, the client connection times out with the error:
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "https://states.us-east-1.amazonaws.com/"
Would it be better to catch the error and retry the long poll (every 5 minutes when no activity is present), or try to terminate the call early and retry before the error? I've thought about using a different type of loop, but I want to maximize the value of long polling and not repeatedly request against the Step Functions API (although if that's the best way I'll do it).
Thank you,
Andrew
import boto3
import time
import json
region = 'us-east-1'
activity_arn = 'arn:aws:states:us-east-1:754112345676:activity:Process_Imagery'
while True:
client = boto3.client('stepfunctions', region_name=region)
response = client.get_activity_task(activityArn=activity_arn,
workerName='imagery_processor')
activity_token = response['taskToken']
input_params = json.loads(response['input'])
print("================")
print(input_params)
client.send_task_success(taskToken=activity_token, output='true')

I believe I answered my own question here. The AWS documentation states:
"The maximum time the service holds on to the request before responding is 60 seconds. If no task is available within 60 seconds, the poll returns a taskToken with a null string."
However, instead of string being returned, I believe the JSON response from StepFunctions has no 'taskToken' at all. This while loop works:
import boto3
import time
import json
from botocore.config import Config as BotoCoreConfig
region = 'us-east-1'
boto_config = BotoCoreConfig(read_timeout=70, region_name=region)
sf_client = boto3.client('stepfunctions', config=boto_config)
activity_arn = 'arn:aws:states:us-east-1:754185699999:activity:Process_Imagery'
while True:
response = sf_client.get_activity_task(activityArn=activity_arn,
workerName='imagery_processor')
if 'taskToken' not in response:
print('No Task Token')
# time.sleep(2)
else:
print(response['taskToken'])
print("===================")
activity_token = response['taskToken']
sf_client.send_task_success(taskToken=activity_token, output='true')

Django: Cleaning up redis connection after client disconnects from stream

I've implemented a Server Sent Event API in my Django app to stream realtime updates from my backend to the browser. The backend is a Redis pubsub. My Django view looks like this:
def event_stream(request):
"""
Stream worker events out to browser.
"""
listener = events.Listener(
settings.EVENTS_PUBSUB_URL,
channels=[settings.EVENTS_PUBSUB_CHANNEL],
buffer_key=settings.EVENTS_BUFFER_KEY,
last_event_id=request.META.get('HTTP_LAST_EVENT_ID')
)
return http.HttpResponse(listener, mimetype='text/event-stream')
And the events.Listener class that I'm returning as an iterator looks like this:
class Listener(object):
def __init__(self, rcon_or_url, channels, buffer_key=None,
last_event_id=None):
if isinstance(rcon_or_url, redis.StrictRedis):
self.rcon = rcon_or_url
elif isinstance(rcon_or_url, basestring):
self.rcon = redis.StrictRedis(**utils.parse_redis_url(rcon_or_url))
self.channels = channels
self.buffer_key = buffer_key
self.last_event_id = last_event_id
self.pubsub = self.rcon.pubsub()
self.pubsub.subscribe(channels)
def __iter__(self):
# If we've been initted with a buffer key, then get all the events off
# that and spew them out before blocking on the pubsub.
if self.buffer_key:
buffered_events = self.rcon.lrange(self.buffer_key, 0, -1)
# check whether msg with last_event_id is still in buffer. If so,
# trim buffered_events to have only newer messages.
if self.last_event_id:
# Note that we're looping through most recent messages first,
# here
counter = 0
for msg in buffered_events:
if (json.loads(msg)['id'] == self.last_event_id):
break
counter += 1
buffered_events = buffered_events[:counter]
for msg in reversed(list(buffered_events)):
# Stream out oldest messages first
yield to_sse({'data': msg})
try:
for msg in self.pubsub.listen():
if msg['type'] == 'message':
yield to_sse(msg)
finally:
logging.info('Closing pubsub')
self.pubsub.close()
self.rcon.connection_pool.disconnect()
I'm able to successfully stream events out to the browser with this setup. However, it seems that the disconnect calls in the listener's "finally" don't ever actually get called. I assume that they're still camped out waiting for messages to come from the pubsub. As clients disconnect and reconnect, I can see the number of connections to my Redis instance climbing and never going down. Once it gets to around 1000, Redis starts freaking out and consuming all the available CPU.
I would like to be able to detect when the client is no longer listening and close the Redis connection(s) at that time.
Things I've tried or thought about:
A connection pool. But as the redis-py README states, "It is not safe to pass PubSub or Pipeline objects between threads."
A middleware to handle the connections, or maybe just disconnections. This won't work because a middleware's process_response() method gets called too early (before http headers are even sent to the client). I need something called when the client disconnects while I'm in the middle of streaming content to them.
The request_finished and got_request_exception signals. The first, like process_response() in a middleware, seems to fire too soon. The second doesn't get called when a client disconnects mid-stream.
Final wrinkle: In production I'm using Gevent so I can get away with keeping a lot of connections open at once. However, this connection leak issue occurs whether I'm using plain old 'manage.py runserver', or Gevent monkeypatched runserver, or Gunicorn's gevent workers.

UPDATE: As of Django 1.5, you'll need to return a StreamingHttpResponse instance if you want to lazily stream things out as I'm doing in this question/answer.
ORIGINAL ANSWER BELOW
After a lot of banging on things and reading framework code, I've found what I think is the right answer to this question.
According to the WSGI PEP, if your application returns an iterator with a close() method, it should be called by the WSGI server once the response has finished. Django supports this too. That's a natural place to do the Redis connection cleanup that I need.
There's a bug in Python's wsgiref implementation, and by extension in Django's 'runserver', that causes close() to be skipped if the client disconnects from the server mid-stream. I've submitted a patch.
Even if the server honors close(), it won't be called until a write to the client actually fails. If your iterator is blocked waiting on the pubsub and not sending anything, close() won't be called. I've worked around this by sending a no-op message into the pubsub each time a client connects. That way when a browser does a normal reconnect, the now-defunct threads will try to write to their closed connections, throw an exception, then get cleaned up when the server calls close(). The SSE spec says that any line beginning with a colon is a comment that should be ignored, so I'm just sending ":\n" as my no-op message to flush out stale clients.
Here's the new code. First the Django view:
def event_stream(request):
"""
Stream worker events out to browser.
"""
return events.SSEResponse(
settings.EVENTS_PUBSUB_URL,
channels=[settings.EVENTS_PUBSUB_CHANNEL],
buffer_key=settings.EVENTS_BUFFER_KEY,
last_event_id=request.META.get('HTTP_LAST_EVENT_ID')
)
And the Listener class that does the work, along with a helper function to format the SSEs and an HTTPResponse subclass that lets the view be a little cleaner:
class Listener(object):
def __init__(self,
rcon_or_url=settings.EVENTS_PUBSUB_URL,
channels=None,
buffer_key=settings.EVENTS_BUFFER_KEY,
last_event_id=None):
if isinstance(rcon_or_url, redis.StrictRedis):
self.rcon = rcon_or_url
elif isinstance(rcon_or_url, basestring):
self.rcon = redis.StrictRedis(**utils.parse_redis_url(rcon_or_url))
if channels is None:
channels = [settings.EVENTS_PUBSUB_CHANNEL]
self.channels = channels
self.buffer_key = buffer_key
self.last_event_id = last_event_id
self.pubsub = self.rcon.pubsub()
self.pubsub.subscribe(channels)
# Send a superfluous message down the pubsub to flush out stale
# connections.
for channel in self.channels:
# Use buffer_key=None since these pings never need to be remembered
# and replayed.
sender = Sender(self.rcon, channel, None)
sender.publish('_flush', tags=['hidden'])
def __iter__(self):
# If we've been initted with a buffer key, then get all the events off
# that and spew them out before blocking on the pubsub.
if self.buffer_key:
buffered_events = self.rcon.lrange(self.buffer_key, 0, -1)
# check whether msg with last_event_id is still in buffer. If so,
# trim buffered_events to have only newer messages.
if self.last_event_id:
# Note that we're looping through most recent messages first,
# here
counter = 0
for msg in buffered_events:
if (json.loads(msg)['id'] == self.last_event_id):
break
counter += 1
buffered_events = buffered_events[:counter]
for msg in reversed(list(buffered_events)):
# Stream out oldest messages first
yield to_sse({'data': msg})
for msg in self.pubsub.listen():
if msg['type'] == 'message':
yield to_sse(msg)
def close(self):
self.pubsub.close()
self.rcon.connection_pool.disconnect()
class SSEResponse(HttpResponse):
def __init__(self, rcon_or_url, channels, buffer_key=None,
last_event_id=None, *args, **kwargs):
self.listener = Listener(rcon_or_url, channels, buffer_key,
last_event_id)
super(SSEResponse, self).__init__(self.listener,
mimetype='text/event-stream',
*args, **kwargs)
def close(self):
"""
This will be called by the WSGI server at the end of the request, even
if the client disconnects midstream. Unless you're using Django's
runserver, in which case you should expect to see Redis connections
build up until http://bugs.python.org/issue16220 is fixed.
"""
self.listener.close()
def to_sse(msg):
"""
Given a Redis pubsub message that was published by a Sender (ie, has a JSON
body with time, message, title, tags, and id), return a properly-formatted
SSE string.
"""
data = json.loads(msg['data'])
# According to the SSE spec, lines beginning with a colon should be
# ignored. We can use that as a way to force zombie listeners to try
# pushing something down the socket and clean up their redis connections
# when they get an error.
# See http://dev.w3.org/html5/eventsource/#event-stream-interpretation
if data['message'] == '_flush':
return ":\n" # Administering colonic!
if 'id' in data:
out = "id: " + data['id'] + '\n'
else:
out = ''
if 'name' in data:
out += 'name: ' + data['name'] + '\n'
payload = json.dumps({
'time': data['time'],
'message': data['message'],
'tags': data['tags'],
'title': data['title'],
})
out += 'data: ' + payload + '\n\n'
return out

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.