Synchronize Memcache and Datastore on Google App Engine

Synchronize Memcache and Datastore on Google App Engine - python

I'm writing a chat application using Google App Engine. I would like chats to be logged. Unfortunately, the Google App Engine datastore only lets you write to it once per second. To get around this limitation, I was thinking of using a memcache to buffer writes. In order to ensure that no data is lost, I need to periodically push the data from the memcache into the data store.
Is there any way to schedule jobs like this on Google App. Engine? Or am I going about this in entirely the wrong way?
I'm using the Python version of the API, so a Python solution would be preferred, but I know Java well enough that I could translate a Java solution into Python.

To get around the write/update limit of entity groups (note that entities without parent are their own entity group) you could create a new entity for every chat message and keep a property in them that would reference a chat they belong to.
You'd then find all chat messages that belong to a chat via a query. But this would be very inefficient, as you'd then need to do a query for every user for every new message.
So go with the above advice, but additionally do:
Look into backends. This are always-on instances where you could aggregate chat messages in memory (and immediately/periodically flush them to datastore). When user requests latest chat messages, you already have them in memory and would serve them instantly (saving on time and cost compared to using Datastore). Note that backends are not 100% reliable, they might go down from time to time - adjust chat message flushing to datastore accordingly.
Check out Channels API. This will allow you to notify users when there is a new chat message. This way you'd avoid polling for new chat messages and keep the number or requests down.

Sounds like the wrong way since you are risking losing data on memcache.
You can write to one entity group once per second.
You can write separate entity groups very rapidly. So it really depends how you structure your data. For example, if you kept an entire chat in one entity, you can only write that chat once per second. And you'd be limited to 1MB.
You should write a separate entity per message in the chat, you can write very, very quickly, but you need to devise a way to pull all the messages together, in order for the log.
Edit I agree with Peter Knego that the costs of using one entity per message will get way too expensive. His backend suggestion is pretty good too, although if your app is popular, backends don't scale that well.
I was trying to avoid sharding, but I think it will be necessary. If you're not familiar with sharding, read up on this: https://developers.google.com/appengine/articles/sharding_counters
Sharding would be an intermediate between writing one entity for all messages in a conversation, vs one entity per message. You would randomly split the messages between a number of entities. For example, if you save the messages in 3 entities, you can write 5x/sec (I doubt most human conversations would go any faster than that).
On fetching, you would need to grab the 3 entities, and merge the messages in chronological order. This would save you a lot on cost. But you would need to write the code to do the merging.
One other benefit is that your conversation limit would now be 3MB instead of 1MB.

Why not use a pull task? I highly recommend this Google video is you are not familiar enough with task queues. First 15 minutes will cover pull queue info that may apply to your situation. Anything involving per message updates may get quite expensive re: database ops, and this will be greatly exacerbated if you have any indices involved. Video link:
https://www.youtube.com/watch?v=AM0ZPO7-lcE&feature=player_embedded
I would simply set up my chat entity when users initiate it in the on-line handler, passing back the entity id to the chat parties. Send the id+message to your pull queue, and serialize the messages within the chat entity's TextProperty. You wont likely schedule the pull queue cron more often than once per second, so that avoids your entity update limitation. Most importantly: your database ops will be greatly reduced.

I think you could create tasks which will persist the data. This has the advantage that, unlike memcached the tasks are persisted and so no chats would be lost.
when a new chat comes in, create a task to save the chat data. In the task handler do the persist. You could either configure the task queue to pull at 1 per second (or slightly slower) and save each bit of chat data held in the task persist the incoming chats in a temporary table (in different entity groups), and every have the tasks pull all unsaved chats from the temporary table, persist them to the chat entity then remove them from the temporary table.

i think you would be fine by using the chat session as entity group and save the chat messages .
this once per second limit is not the reality, you can update/save at a higher rate and im doing it all the time and i don't have any problem with it.
memcache is volatile and is the wrong choice for what you want to do. if you start encountering issues with the write rate you can start setting up tasks to save the data.

Related

How to rule out duplicate messages in Pubsub without using Dataflow and without using ACK in Python?

I have a use-case wherein I want to read the messages in Pubsub without acknowledging the messages. I would need help in on how to rule out the possibility of "duplicate messages" which will remain in Pubsub store when I don't ACK the delivered message.
Solutions that I have thought of:
Store the pulled messages in Datastore and see if they are same.
Store the pulled messages at runtime and check if my message is duplicate O(n) time complexity and space complexity O(n).
Store the pulled messages in a file and compare the new incoming messages from the messages in the file.
Use Dataflow and rule out the possibility (least expected)
I see that there is no offset like feature in Pubsub which is similar to Kafka, I think.
Which is the best approach that you would suggest in this matter/ or any other alternative approach that I can use?
I am using python google-cloud-pubsub_v1 to create a python client and pulling messages from Pubsub.
I am sharing the code which is the logic to pull the data
subscription_path = subscriber.subscription_path(
project_id, subscription_name)
NUM_MESSAGES = 3
# The subscriber pulls a specific number of messages.
response = subscriber.pull(subscription_path, max_messages=NUM_MESSAGES)
for received_message in response.received_messages:
print(received_message.message.data)

It sounds like Pub/Sub is probably not the right tool for the job. It seems as if you are trying to use Pub/Sub as a persistent data store, which is not the intended use case. Acking is a fundamental part of the lifecycle of a Cloud Pub/Sub message. Pub/Sub messages are deleted if they are unacked after the provided message retention period, which cannot be longer than 7 days.
I would suggest instead that you consider using an SQL database like Cloud Spanner. Then, you could generate a uuid for each message, use this as the primary key for deduplication, and transactionally update the database to ensure there are no duplicates.
I might be able to provide a better answer for you if you provide more information about what you are planning on doing with the deduplicated messages.

Handling Notifications to Users

I am building an application on GAE that needs to notify users when another user performs an action that affects them. A real world analogy would be being alerted when your friend comments on your facebook status.
I understand how the Channel API works to actually send notifications in real time, but I'm trying to understand the most effective way to store those notifications in the datastore. Ideally, I want the notification code to be decoupled from the actual event being performed. Is this a good use case for Prospective Search? It doesn't quite feel right since I don't need to perform any kind of searching, just: if you see a new comment, create a new notification that is stored in the datastore and pushed to the client through the channel api if they are connected. I basically need a database trigger, but I don't think GAE supports that.

Why don't you want to couple the event and its notifications in the first place?
I think it may be interesting to know in order to help you with your use case :)
If I had to do this I would launch a task queue anytime I write to the datastore something that might fire events...
That way you can do your write and have a separate "layer" to process the events.
Triggers would not even be that good of an option, since your application would have to "poll" the database to push events to the users' UI.
I think your process (firing events) does not belong to the database, since it might as well need business rules that the datastore cannot provide : for example when a users ignores another one, you should not fire events.
The more business logic you put in your database system, the more complex it gets to maintain & scale IMHO...

Looks like GAE does support mimicking database triggers using hooks.
Hooks can be useful for
query caching
auditing Datastore activity per-user
mimicking database triggers

What's the preferred method for throttle websocket connections?

I have a web app where I am streaming model changes to a backbone collection in a chrome client. There a a few backbone views that may or may not render parts of the page depending on the type of update and what is being looked at. For example some changes to a model result in the view for the collection being re-rendered and there may or may not be a detail panel view open for the model that's being updated. These model changes can happen very fast as the server side workflow involves quite verbose and rapid changes to the model.
Here's the problem: I'm getting a large number of errno 32 pipe broken messages in the webserver's process when sending messages to the client, although the websocket connection is still up and its readyState is still 1 (OPEN).
What I suspect is happening is that the various views haven't finished rendering in the onmessage callback by the time the next message is coming in. After I get these tracebacks in stdout the websocket connection can still work and the UI will still update.
If I put eventlet.sleep(0.02) in the loop that reads model changes off the message queue and sends them on the websocket the broken pipe messages go away, however this isn't a real solution and feels like a nasty hack.
Has anyone has similar problems with websocket's onmessage function trying to do too much work and still being busy when the next message comes in? Anyone have a solution?

I think the most efficient way to do this is that client app tell the server what they are displaying. The server keep track of this and send changes only to the objects currently viewed, only to the concerned client.
A way to do this is by using a "Who Watch What" list of items.
Items are indexed in two ways. From the client ID and with a isVievedBy chainlist inside each data objects (I know it doesn't look clean to mix it with data but it is very efficient).
You'll also need a lastupdate timestamp for each data object.
When a client change view, it send a "I'm viewing this, wich I have the version -timestamp-" message to the server. The server check timestamp and send back the object if required. It also remove obsolete "Who Watch What" (accessing them by client ID) items and create the new ones.
When a data object is updated, loop through the isVievedBy chainlist of this object to know which client should be updated. Put this in message buffers for each client and flush those buffers manually (in case you update several items at the same time, it will send one big message).
This is lot of work, but your app will be efficient and scale gracefully, even with lot of objects and lot of clients. It sends only usefull messages and it is very unlikely that there will be too many of them.
For your onMessage problem, I would store data in a queue and process them asynchronously.

How can I synchronize access to stored data?

I am using Google App Engine with the Python 2.7 run time. I know it is multi-threaded, but the multi-instance nature of Google App Engine makes this question relevant to any run time.
I have a web messenger application. It is based on the Channel API for receiving various notifications (message received, user connected, user typed) and on the Memcache API for maintaining global state (and on the High Replication Datastore for actually storing the message history).
The time stamp of the last message is kept in Memcache (along with other data, such whether the user is online or not, typing or not and others).
Whenever a user sends a message, the Memcache is updated with the new time stamp value, while the old time stamp value is sent using the Channel API.
When multiple messages are sent at the same time, the Memcache is sometimes overwritten and I get weird values for that time stamp. For example -
One message has a certain time stamp and the next has an older time stamp.
I know about the Memcache Client API, but I cannot use that because I must read the most up to date time stamp first and the write the new time stamp.
In short, I want to somehow wait for the (Memcache?) data to be completely unoccupied (read or write), lock it somehow, proceed throughout the entire request and then release it for the next request.
Any suggestions?
Thank you in advance. :)

There is no explicit locking functionality in the Memcache API, but you can achieve the effect using the Compare-and-Set mechanism. Guido has a writeup with examples here. Only the Client class exposes CAS, though, but it's unclear from your description why that won't work for you.

Recommendation for click/event tracking mechanisms (python, django, celery, mongo etc)

I'm looking into way to track events in a django application (events would generally be clicks tied to a specific unique user id).
These events would essentially contain an event type like "click" and then each click event would be assigned to a unique id (many events can go to one id) and each event would have a data set including items like referrer etc...
I have tried mixpanel, but for now the data api they are offering seems too limiting as I can't seem to find a way to get all of my data out by a unique id (apart from the event itself).
I'm looking into using django-eventracker, but curious about any others thought on the best way to do this. Mongo or CouchDb seem like a great choice here, but the celery/rabbitmq looks really attractive with mongo. Pumping these events into the existing applications db seems limiting at this point.
Anyways, this is just a thread to see what others thoughts are on this and how they have implemented something like this...
shoot

I am not familiar with the pre-packaged solutions you mention. Were I to design this from scratch, I'd have a simple JS collecting info on clicks and posting it back to the server via Ajax (using whatever JS framework you're already using), and on the server side I'd simply append that info to a log file for later "offline" processing -- so that would be independent of django or other server-side framework, essentially.
Appending to a log file is a very light-weight action, while DBs for web use are generally way optimized for read-intensive (not write-intensive) operation, so I agree with you that force fitting that info (as it trickes in) into the existing app's DB is unlikely to offer good performance.

You probably want to keep a flexible format for your logs to anticipate future needs or changes. In this sense, the schema-less document-oriented databases are nice. One advantage is that the structure of your data will be close to your application needs for whatever analyses you perform later (so, avoiding some of the inevitable parsing/data munging work).
If you're thinking about using mysql, postgresql or such, then you should look into something like rsyslog for buffering writes and avoiding the performance penalty with heavy logging. (I can't say much about celery and other queueing mechanisms for this type of thing, but they sound promising.)
Mongodb has a some nice features that make it amenable to logging such as capped collections. A summary can be found in this post.

If by click, you mean a click on a link that loads a new page (or performs an AJAX request), then what you aim to do is fairly straightforward. Web servers tend to keep plain-text logs about requests - with information about the user, time/date, referrer, the page requested, etc. You could examine these logs and mine the statistics you need.
On the other hand, if you have a web application where clicks don't necessarily generate server requests, then collecting click information with javascript is your best bet.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.