What's the preferred method for throttle websocket connections? - python

I have a web app where I am streaming model changes to a backbone collection in a chrome client. There a a few backbone views that may or may not render parts of the page depending on the type of update and what is being looked at. For example some changes to a model result in the view for the collection being re-rendered and there may or may not be a detail panel view open for the model that's being updated. These model changes can happen very fast as the server side workflow involves quite verbose and rapid changes to the model.
Here's the problem: I'm getting a large number of errno 32 pipe broken messages in the webserver's process when sending messages to the client, although the websocket connection is still up and its readyState is still 1 (OPEN).
What I suspect is happening is that the various views haven't finished rendering in the onmessage callback by the time the next message is coming in. After I get these tracebacks in stdout the websocket connection can still work and the UI will still update.
If I put eventlet.sleep(0.02) in the loop that reads model changes off the message queue and sends them on the websocket the broken pipe messages go away, however this isn't a real solution and feels like a nasty hack.
Has anyone has similar problems with websocket's onmessage function trying to do too much work and still being busy when the next message comes in? Anyone have a solution?

I think the most efficient way to do this is that client app tell the server what they are displaying. The server keep track of this and send changes only to the objects currently viewed, only to the concerned client.
A way to do this is by using a "Who Watch What" list of items.
Items are indexed in two ways. From the client ID and with a isVievedBy chainlist inside each data objects (I know it doesn't look clean to mix it with data but it is very efficient).
You'll also need a lastupdate timestamp for each data object.
When a client change view, it send a "I'm viewing this, wich I have the version -timestamp-" message to the server. The server check timestamp and send back the object if required. It also remove obsolete "Who Watch What" (accessing them by client ID) items and create the new ones.
When a data object is updated, loop through the isVievedBy chainlist of this object to know which client should be updated. Put this in message buffers for each client and flush those buffers manually (in case you update several items at the same time, it will send one big message).
This is lot of work, but your app will be efficient and scale gracefully, even with lot of objects and lot of clients. It sends only usefull messages and it is very unlikely that there will be too many of them.
For your onMessage problem, I would store data in a queue and process them asynchronously.

Related

gRPC streaming proxy | Architectural decision

I'm building microservices app that supposed to launch tests in separate containers. I also want to stream log messages to frontend and store them in database and I don't know what solution would be better in that case:
Use my backend as proxy to store and redirect messages to the frontend
Stream messages to backend and frontend to avoid redirection
Frontend could start observing the test at anytime and if in the first case I could just prepend messages that I would read from database, in second case I will need to handle concurrency issues in some way.
I'm stacked with writing proto files, so there is nothing much to post here. Just figured out how bulky backend service going to be if I will pick first decision, since will need to duplicate TestRunner's calls there.
Please also let me know if you see some other issues I'm going to face by picking any of those decisions. Thanks in advance!
[DUPLICATE]

With imap_tools (or imaplib), how do I sync imap changes instead of polling by repeatedly fetching the entire imap database?

Since there are several similar sounding questions around I want to be very precise.
Edit: Let's focus on specifically on reacting dynamically to any email message being moved from one folder to another.
A typical imap client app fetches only changes in the imap database since last sync. If your email client had to fetch every email each time you run it, that would take a long time.
Unfortunately my imap_tools app has to fetch (headers only) the entire imap database every time I run it. In order to detect changes dynamically, I would have to poll the entire set of messages repeatedly. Obviously, this is not a reasonable design.
Does imap_tools (or the underlying imaplib) provide a mechanism for syncing?
Using the "seen" flag is not it. That is for indicating whether a human has read the message, and also is not specific to the specific client.
Relying on uid is not quite it because I want to detect if the user has deleted or moved a message from one folder to another.
IMAP, at it's core, is an old and not terribly efficient protocol, as the design was not focused on syncing. Kundrát calls it a Cache Filing Protocol: the server is the one source of truth, and it is the client's job to display this to the user, and usually to cache as much of this as possible.
In Baseline IMAP, this generally means connecting to the server, and interrogating and caching as much information as the client cares to show. Number of messages, headers, flags, possibly bodies, maybe attachments.
It also assumes the client has a mostly stable network connection while it is in use, which was true of most desktop mode clients. Once you have all your data synced, the server can send you unsolicited responses: EXISTS when a new message comes in; STORE when flags are updated, EXPUNGE when a message is deleted. A server will not normally send these except in response to a permitted user command. Older clients often used NOOP, or perhaps CHECK for this.
If you lose your connection, clients will reconnect and refresh their cache. Since the only mutable things about messages is their existence and flags, this is usually fairly quick: the client will usually request all the flags for all messages. From there it can quickly update its cache. Apply flags. Fetch headers for new UIDs it discovered, remove the cached version of UIDs it didn't receive.
This does start to break down when a folder has many tens of thousands of messages, and you will find clients starts to have very slow startup/syncing speeds on some servers at this point, and start to use rather a lot of data.
IMAP as a protocol cannot track messages across folders. The state per folder is completely separate. If it is moved, it is equivalent to a removal from one folder and an add to another. Desktop clients often maintain a pool of connections to watch more than a folder at a time. You could apply heuristics to your cached messages to try to detect folder moves (eg, a selection of headers and metadata) but it can't be perfect.
As you can see, a lot of this is terribly inefficient once your mailbox grows past a few hundred messages, so there's a lot of extensions to make caching more efficient.
UIDPLUS (RFC4315) is almost everywhere. This requires the server to support UIDs in more commands, and is almost required for any cache-mode client, as message sequence numbers are unreliable when deletions are involved.
IDLE (RFC 2177) is fairly common, but not everywhere. The client can issue an IDLE command, and this tells the server it's ready for those unsolicited updates at any time. This means the client doesn't have to poll every few minutes with the NOOP command.
CONDSTORE (RFC 4551) is on most unix-type servers, and some commercial servers. It, among other things, associates a serial number with flag changes. This allows the flag resync step to only get the changes from the most recent serial number it knows about. It however does not help with detection with deleted messages, and a UID SEARCH ALL would still be necessary to find those after disconnection.
QRESYNC (RFC5162) provides resynchronization data for deleted messages. This unfortunately is a quite rare extension, and is almost nonexistent on large commercial servers.
NOTIFY (RFC5465) is almost nowhere. It's supposed to be like a super-IDLE that can monitor multiple mailboxes at the same time.
Gmail Extensions is of course Gmail specific. It, among other things, associates a permanent identifier with each message (X-GM-MSGID), which DOES allow it to be reliably tracked across folders. It also provides the "ALL MAIL" folder and Labels, which means you could sync the whole account by just syncing the All Mail folder. Like other servers, this does start to get bandwidth inefficient when hitting tens of thousands of messages.
From my experience of participating in the development of several mobile email clients which emphasized bandwidth efficiency and responsiveness, a client can appear very responsive even while dealing with all the problems of IMAP. IDLE can be used to try to keep the INBOX in sync. If you can't do that, you can hide a lot of jank by only keeping the most recent week's messages in total sync, and sync the rest less frequently (UID SEARCH SINCE is helpful here). The user is usually only looking at the end of their inbox, and generally only cares about new messages coming in.
And in general, mirroring the move of a message was actually just detected as a Delete and an Add, it's just internet connections and servers are super fast and something that takes a couple hundred ms might look instant to a user. If any optimization is occurring, it's heuristic. I think Thunderbird can have a protocol log you can turn on. If you're really curious what it's doing, turn it on and move a message and see what it does.
You can:
Use search args for limit data set: date_gte, date_lt, new ...
Rely on message-id from headers if you store something
Use mailbox.move for reliable "mark" msg instead flags
Calculate msg hash
All depends on you task.
As I know, there is no "sync" in IMAP, there is IDLE.
since 0.51.0 imap_tools has IDLE support:
https://github.com/ikvk/imap_tools/releases/tag/v0.51.0

reload image/page after computation is complete from the server side

I have a python code that performs some fairly intense computations, and then generates a plot (png file) for display on the web. I'm using python, flask, mod_wsgi, and apache. Since the computation takes several seconds (around 10 seconds), I'd like to display a "LOADING" type image while the computation is happening so that the viewer doesn't think the server is messed up, and then the actual image when computations are complete. How can I do this from the server side (not using javascript in the web browser)? In the past I remember seeing a lot of web pages where it seemed like the server was pushing a new page to the browser (from what a recall most it was search engines on message forums). The answer to this question I believe is really an http related question, so it doesn't necessarily have to be specific to serving an image (it could be an html page), or using python, flask, mod_wsgi, or apache, but it would be great if an example could be giving for just that configuration.
Before Javascript I did implement this by generating a page that had a refresh in the HTML header, with a delay of 2-3 seconds.
This page would redisplay itself until the code generating that page noticed that the 'result' was finished then generating different HTML code (without the refresh).
<HEAD>
<META http-equiv="refresh" content="3">
</HEAD>
I'm aware that this question is a bit old now, but I feel like there is not enough information available on this topic. I was struggling with this myself to find a relevant information, so I hope this will help:
Suggested solution
There are different technologies that can be used here, but I believe the simplest would be Server Sent Events. The implementation for Flask can be found here. The last part of the documentation is really important:
Subscribers will connect and block for a long time, so you should seriously consider running under an asynchronous WSGI server, such as gunicorn+gevent
So make sure fulfil the requirement. Also, it's pretty important to understand that this approach is good if you want to send the messages from your server to the client. In case you have an external worker that does the calculations for you this method will only make it more complicated for you, since your server will have to play the role of a middle man between the browser and the worker machine. On some hostings it may even not work as expected (e.g. Heroku - still not sure why it misbehaves, looks like too many updates from the worker and are not propagated properly to the client). In case you use the same host for your app and the workers, you should have no problem though.
Alternate solution
In my opinion this type of calculations belong to background, so this solution assumes that we have some kind of workers doing the job for you (like I had when I first encountered the problem). Note that this solution is not a server->client communication, but it's based on polling. I think this may be the only option if you don't run on the asynchronous server in production.
So let's assume you have a worker which status you can check, for example Iron Worker. The user visits your page and this invokes the calculations on the worker. From this point on you should use AJAX calls to get the status update directly from your worker. What I did in my app, I used jQuery to poll the worker web api and learn about it's status. After you discover that your worker is done, you can just reload the page or just the image or whatever else you need.
Additional information
If you need to update many places at the same time (not only the browser), you can use queue services, for example ironMQ, which allows you to propagate your messages to a special queue, and then subscribe to this queue with a client and receive the messages from it. This is what I did before I discovered I can query the worker directly for it's status.

How does reddit send email?

I am trying to learn how to large organisations that use python structure their code so that I can maybe apply some of their theories to my own code.
Currently I am looking through reddit's code and am interested how they have implemented the sending of emails generated as part of the app's operations. See: https://github.com/reddit/reddit/blob/master/r2/r2/lib/emailer.py (their emailing library) and https://github.com/reddit/reddit/blob/master/r2/r2/models/mail_queue.py
I think mail_queue.py contains some form of SqlAlchemy table backed email queue.
Is this the case. Does that mean the table is kept in memory? Could somebody make this a little clearer for me?
Cheers from Down Under.
P.S. May I suggest if anybody is trying to get a good understanding of how to structure python apps they do the same as I am. Reading and understanding other peoples code has allowed me to structure and write noticeably better code.. :) Open source stuff is great!
Traditionally, the mail queue on e-mail servers has been some sort of disk storage. The reason for this is so that the chance of mail getting lost is minimized. For example, the mail server would receive a message and not send back a successful return code to the sending mail client until the entire message was successfully written to disk via synchronous write.
Yes, the reddit code is using a database as a email data store via SqlAlchemy.
As far as the table being stored in memory, I wouldn't imagine that it would be. From reading the SqlAchemy documentation, the Table object is SqlAlchemy is just a proxy to the underlying table in whatever database is backing the system. In general, you wouldn't want the table in memory since you don't know how many messages the system will process, how big the e-mail messages are, and how many messages need to be queued in case of a temporary mail sending failure.

Synchronize Memcache and Datastore on Google App Engine

I'm writing a chat application using Google App Engine. I would like chats to be logged. Unfortunately, the Google App Engine datastore only lets you write to it once per second. To get around this limitation, I was thinking of using a memcache to buffer writes. In order to ensure that no data is lost, I need to periodically push the data from the memcache into the data store.
Is there any way to schedule jobs like this on Google App. Engine? Or am I going about this in entirely the wrong way?
I'm using the Python version of the API, so a Python solution would be preferred, but I know Java well enough that I could translate a Java solution into Python.
To get around the write/update limit of entity groups (note that entities without parent are their own entity group) you could create a new entity for every chat message and keep a property in them that would reference a chat they belong to.
You'd then find all chat messages that belong to a chat via a query. But this would be very inefficient, as you'd then need to do a query for every user for every new message.
So go with the above advice, but additionally do:
Look into backends. This are always-on instances where you could aggregate chat messages in memory (and immediately/periodically flush them to datastore). When user requests latest chat messages, you already have them in memory and would serve them instantly (saving on time and cost compared to using Datastore). Note that backends are not 100% reliable, they might go down from time to time - adjust chat message flushing to datastore accordingly.
Check out Channels API. This will allow you to notify users when there is a new chat message. This way you'd avoid polling for new chat messages and keep the number or requests down.
Sounds like the wrong way since you are risking losing data on memcache.
You can write to one entity group once per second.
You can write separate entity groups very rapidly. So it really depends how you structure your data. For example, if you kept an entire chat in one entity, you can only write that chat once per second. And you'd be limited to 1MB.
You should write a separate entity per message in the chat, you can write very, very quickly, but you need to devise a way to pull all the messages together, in order for the log.
Edit I agree with Peter Knego that the costs of using one entity per message will get way too expensive. His backend suggestion is pretty good too, although if your app is popular, backends don't scale that well.
I was trying to avoid sharding, but I think it will be necessary. If you're not familiar with sharding, read up on this: https://developers.google.com/appengine/articles/sharding_counters
Sharding would be an intermediate between writing one entity for all messages in a conversation, vs one entity per message. You would randomly split the messages between a number of entities. For example, if you save the messages in 3 entities, you can write 5x/sec (I doubt most human conversations would go any faster than that).
On fetching, you would need to grab the 3 entities, and merge the messages in chronological order. This would save you a lot on cost. But you would need to write the code to do the merging.
One other benefit is that your conversation limit would now be 3MB instead of 1MB.
Why not use a pull task? I highly recommend this Google video is you are not familiar enough with task queues. First 15 minutes will cover pull queue info that may apply to your situation. Anything involving per message updates may get quite expensive re: database ops, and this will be greatly exacerbated if you have any indices involved. Video link:
https://www.youtube.com/watch?v=AM0ZPO7-lcE&feature=player_embedded
I would simply set up my chat entity when users initiate it in the on-line handler, passing back the entity id to the chat parties. Send the id+message to your pull queue, and serialize the messages within the chat entity's TextProperty. You wont likely schedule the pull queue cron more often than once per second, so that avoids your entity update limitation. Most importantly: your database ops will be greatly reduced.
I think you could create tasks which will persist the data. This has the advantage that, unlike memcached the tasks are persisted and so no chats would be lost.
when a new chat comes in, create a task to save the chat data. In the task handler do the persist. You could either configure the task queue to pull at 1 per second (or slightly slower) and save each bit of chat data held in the task persist the incoming chats in a temporary table (in different entity groups), and every have the tasks pull all unsaved chats from the temporary table, persist them to the chat entity then remove them from the temporary table.
i think you would be fine by using the chat session as entity group and save the chat messages .
this once per second limit is not the reality, you can update/save at a higher rate and im doing it all the time and i don't have any problem with it.
memcache is volatile and is the wrong choice for what you want to do. if you start encountering issues with the write rate you can start setting up tasks to save the data.

Categories