Reading data consecutively in a AWS SQS queue - python

I am very new to AWS SQS queues and I am currently playing around with boto. I noticed that when I try to read a queue filled with messages in a while loop, I see that after 10-25 messages are read, the queue does not return any message (even though the queue has more than 1000+ messages). It starts populating another set of 10-25 messages after a few seconds or on stopping and restarting the the program.
while true:
read_queue() // connection is already established with the desired queue.
Any thoughts on this behaviour or point me in the right direction. Just reiterating I am just couple of days old to SQS !!
Thanks

That's the way that SQS queues work by default (short polling). If you haven't changed any settings after setting up your queue, the default is to get messages from a weighted random sampling of machines. If you're using more than one machine and want all the messages you can consume at that moment (across all machines), you need to use long polling. See the Amazon documentation here. I don't think boto supports that directly ATM.

Long polling is more efficient because it allows you to leave the HTTP connection open for a period of time while you wait for more results. However, you can still do your own polling in boto by just setting up a loop and waiting for some period of time between reading the queue. You can still get good overall throughput with this polling strategy.

Related

Best approach to tackle long polling in server side

I have a use case where I need to poll the API every 1 sec (basically infinite while loop). The polling will be initiated dynamically by user through an external system. This means there can be multiple polling running at the same time. The polling will be completed when the API returns 400. Anyways, my current implementation looks something like:
Flask APP deployed on heroku.
Flask APP has an endpoint which external system calls to start polling.
That flask endpoint will add the message to queue and as soon as worker gets it, it will start polling. I am using Heroku Redis to Go addons. Under the hood it uses python-rq and redis.
The problem is when some polling process goes on for a long time, the other process just sits on the queue. I want to be able to do all of the polling in a concurrent process.
What's the best approach to tackle this problem? Fire up multiple workers?
What if there could be potentially more than 100 concurrent processes.
You could implement a "weighted"/priority queue. There may be multiple ways of implementing this, but the simplest example that comes to my mind is using a min or max heap.
You shoud keep track of how many events are in the queue for each process, as the number of events for one process grows, the weight of the new inserted events should decrease. Everytime an event is processed, you start processing the following one with the greatest weight.
PS More workers will also speed up the the work.

Is it a bad practice to use sleep() in a web server in production?

I'm working with Django1.8 and Python2.7.
In a certain part of the project, I open a socket and send some data through it. Due to the way the other end works, I need to leave some time (let's say 10 miliseconds) between each data that I send:
while True:
send(data)
sleep(0.01)
So my question is: is it considered a bad practive to simply use sleep() to create that pause? Is there maybe any other more efficient approach?
UPDATED:
The reason why I need to create that pause is because the other end of the socket is an external service that takes some time to process the chunks of data I send. I should also point out that it doesnt return anything after having received or let alone processed the data. Leaving that brief pause ensures that each chunk of data that I send gets properly processed by the receiver.
EDIT: changed the sleep to 0.01.
Yes, this is bad practice and an anti-pattern. You will tie up the "worker" which is processing this request for an unknown period of time, which will make it unavailable to serve other requests. The classic pattern for web applications is to service a request as-fast-as-possible, as there is generally a fixed or max number of concurrent workers. While this worker is continually sleeping, it's effectively out of the pool. If multiple requests hit this endpoint, multiple workers are tied up, so the rest of your application will experience a bottleneck. Beyond that, you also have potential issues with database locks or race conditions.
The standard approach to handling your situation is to use a task queue like Celery. Your web-application would tell Celery to initiate the task and then quickly finish with the request logic. Celery would then handle communicating with the 3rd party server. Django works with Celery exceptionally well, and there are many tutorials to help you with this.
If you need to provide information to the end-user, then you can generate a unique ID for the task and poll the result backend for an update by having the client refresh the URL every so often. (I think Celery will automatically generate a guid, but I usually specify one.)
Like most things, short answer: it depends.
Slightly longer answer:
If you're running it in an environment where you have many (50+ for example) connections to the webserver, all of which are triggering the sleep code, you're really not going to like the behavior. I would strongly recommend looking at using something like celery/rabbitmq so Django can dump the time delayed part onto something else and then quickly respond with a "task started" message.
If this is production, but you're the only person hitting the webserver, it still isn't great design, but if it works, it's going to be hard to justify the extra complexity of the task queue approach mentioned above.

Design for implementing web hooks (including not blocking and ignoring superseding repeat events)

I'm implementing a webhooks provider and trying to solve some problems while minimizing the added complexity to my system:
Not blocking processing of the API call that triggered the event while calling all the hooks so the response to that call will not be delayed
Not making a flood of calls to my listeners if some client is quickly calling my APIs that trigger hooks (i.e. wait a couple seconds and throw away any earlier calls if duplicates come in later)
My environment is Python (Chalice) and AWS Lambda. Ideal solution will be easy to integrate and cheap.
I would use SQS / SNS depending on the exact architecture design. Maybe Apache Kafka, if you need to store events longer...
So upcoming events would be placed on SQS, and then another lambda would be used to do the processing. Problem is that time of processing is limited to 5 min. Also delivering can't be parallel.
Another option is to have one input queue, and one output queue per receiver. So the lambda function, which processes input, just spreads it through other queues. And then other lambdas are responsible for delivering. That way has other obvious problems.
Finally. Your lambda, while processing input, can generate messages on the outgoing queue, instrumenting what message should be delivered to which users. Then you can have one lambda triggered on each message from the outgoing queue. And there you can have a small loop delivering messages. Note that in case of problems you need to send back what was not delivered.
Good point is that SQS has something like a dead letter queue so that problematic messages would not stay there forever.

Best practice for polling an AWS SQS queue and deleting received messages from queue?

I have an SQS queue that is constantly being populated by a data consumer and I am now trying to create the service that will pull this data from SQS using Python's boto.
The way I designed it is that I will have 10-20 threads all trying to read messages from the SQS queue and then doing what they have to do on the data (business logic), before going back to the queue to get the next batch of data once they're done. If there's no data they will just wait until some data is available.
I have two areas I'm not sure about with this design
Is it a matter of calling receive_message() with a long time_out value and if nothing is returned in the 20 seconds (maximum allowed) then just retry? Or is there a blocking method that returns only once data is available?
I noticed that once I receive a message, it is not deleted from the queue, do I have to receive a message and then send another request after receiving it to delete it from the queue? seems like a little bit of an overkill.
Thanks
The long-polling capability of the receive_message() method is the most efficient way to poll SQS. If that returns without any messages, I would recommend a short delay before retrying, especially if you have multiple readers. You may want to even do an incremental delay so that each subsequent empty read waits a bit longer, just so you don't end up getting throttled by AWS.
And yes, you do have to delete the message after you have read or it will reappear in the queue. This can actually be very useful in the case of a worker reading a message and then failing before it can fully process the message. In that case, it would be re-queued and read by another worker. You also want to make sure the invisibility timeout of the messages is set to be long enough the the worker has enough time to process the message before it automatically reappears on the queue. If necessary, your workers can adjust the timeout as they are processing if it is taking longer than expected.
If you want a simple way to set up a listener that includes automatic deletion of messages when they're finished being processed, and automatic pushing of exceptions to a specified queue, you can use the pySqsListener package.
You can set up a listener like this:
from sqs_listener import SqsListener
class MyListener(SqsListener):
def handle_message(self, body, attributes, messages_attributes):
run_my_function(body['param1'], body['param2']
listener = MyListener('my-message-queue', 'my-error-queue')
listener.listen()
There is a flag to switch from short polling to long polling - it's all documented in the README file.
Disclaimer: I am the author of said package.
Another option is to setup a worker application using AWS Beanstalk as described in this blogpost.
Instead of long polling using boto3, your flask application receives the message as a json object in a HTTP post. The HTTP path and type of message being set are configurable in the AWS Elastic Beanstalk Configuration tab:
AWS Elastic Beanstalk has the added benefit of being able to dynamically scale the number of workers as a function of the size of your SQS queue, along with its deployment management benefits.
This is an example application that I found useful as a template.

Processing High-Volume Streaming Data with Twisted or using Threads, Queue in Python

I am getting at extremely fast rate, tweets from a long-lived connection to the Twitter API Streaming Server. I proceed by doing some heavy text processing and save the tweets in my database.
I am using PyCurl for the connection and callback function that care of text processing and saving in the db. See below my approach who is not working properly.
I am not familiar with network programming, so would like to know:
How can use Threads, Queue or Twisted frameworks to solve this problem ?
def process_tweet():
# do some heaving text processing
def open_stream_connection():
connect = pycurl.Curl()
connect.setopt(pycurl.URL, STREAMURL)
connect.setopt(pycurl.WRITEFUNCTION, process_tweet)
connect.setopt(pycurl.USERPWD, "%s:%s" % (TWITTER_USER, TWITTER_PASS))
connect.perform()
You should have a number of threads receiving the messages as they come in. That number should probably be 1 if you are using pycurl, but should be higher if you are using httplib - the idea being you want to be able to have more than one query on the Twitter API at a time, so there is a steady amount of work to process.
When each Tweet arrives, it is pushed onto a Queue.Queue. The Queue ensures that there is thread-safety in the communications - each tweet will only be handled by one worker thread.
A pool of worker threads is responsible for reading from the Queue and dealing with the Tweet. Only the interesting tweets should be added to the database.
As the database is probably the bottleneck, there is a limit to the number of threads in the pool that are worth adding - more threads won't make it process faster, it'll just mean more threads are waiting in the queue to access the database.
This is a fairly common Python idiom. This architecture will scale only to a certain degree - i.e. what one machine can process.
Here's simple setup if you are OK with using a single machine.
1 thread accepts connections. After a connection is accepted, it passes the accepted connection to another thread for processing.
You can, of course, use processes (e.g, using multiprocessing) instead of threads, but I'm not familiar with multiprocessing to give advice. The setup would be the same: 1 process accepts connections, then passes them to subprocesses.
If you need to shard the processing across multiple machines, then the simple thing to do would be to stuff the message into the database, then notify the workers about the new record (this will require some sort of coordination/locking between the workers). If you want to avoid hitting the database, then you'll have to pipe messages from your network process to the workers (and I'm not well versed enough in low level networking to tell you how to do that :))
I suggest this organization:
one process reads Twitter, stuffs tweets into database
one or more processes reads database, processes each, inserts into new database. Original tweets either deleted or marked processed.
That is, you have two more more processes/threads. The tweet database could be seen as a queue of work. Multiple worker processes take jobs (tweets) off the queue, and create data in the second database.

Categories